| runpod | 1.1200 | — | keep | $3.50 | | 4/7/2026 |
| local | — | — | keep | — | | 4/7/2026 |
| runpod | — | — | crash | — | | 4/1/2026 |
| 8xH100 Legal Score-First TTT (3ep SGD cosine) + SLOT lr=0.005 steps=8 + EGGROLL + MuonEq RC + QK_GAIN=4.0 + INT6 GPTQ + brotli-11 + coprime loader + WARMDOWN_ITERS=4000. NEW BEST. val_bpb=1.1563 (exact 1.15630221, SLOT). TTT=1.1628, Post-EMA=1.1816. TTT contribution=-0.0188, SLOT contribution=-0.0065, EGGROLL: 16 improvements. 5879 steps in 600s, 102ms/step. Artifact 14.99MB + code ~160KB = 15.15MB (COMPLIANT). Improvement: -0.0025 vs prior best (1.1588→1.1563). TTT is the biggest eval-time gain so far. | runpod | 1.1563 | 15.1MB | keep | $10.43 | | 4/1/2026 |
| 8xH100 LAWA_ENABLED=1 (k=10, freq=100) vs EMA baseline. REGRESSION: val_bpb=1.1612 (exact 1.16123862, SLOT) vs prior best 1.1588 (+0.0024). Sliding=1.1670, Post-EMA=1.1837. SLOT contribution=-0.0058. EGGROLL: 20 improvements. 5818 steps in 600s, 103ms/step. Artifact 15.02MB + code 150KB = 15.17MB (COMPLIANT). lawa:applying LAWA averaging k=10. Conclusion: EMA (decay=0.997) outperforms LAWA (k=10,freq=100) — revert. | runpod | 1.1612 | 15.2MB | keep | $7.90 | | 4/1/2026 |
| 8xH100 WARMDOWN_ITERS=2000 (vs 4000). REGRESSION: val_bpb=1.1669 (exact 1.16687216, SLOT) vs prior best 1.1588 (+0.0081). Sliding=1.1700, Post-EMA=1.1892. SLOT contribution=-0.0031 (vs prior -0.0061). EGGROLL: 15 improvements. 5829 steps in 600s. Artifact 15.88MB (COMPLIANT). Conclusion: WARMDOWN_ITERS=4000 is better — revert. | runpod | 1.1669 | 15.9MB | keep | $7.90 | | 4/1/2026 |
| 8xH100 EGGROLL fix + WARMDOWN_ITERS=4000 + VR disabled. NEW BEST. val_bpb=1.1588 (exact 1.15880324, SLOT). Post-EMA=1.1814. Sliding window=1.1649 (exact 1.16489309). SLOT contribution=-0.0061. EGGROLL: 20 improvements in 60s (2220 tested). 5863 steps in 600s, 102.35ms/step. Artifact 14.99MB + code 150KB = 15.14MB (COMPLIANT). Improvement: -0.0009 vs prior best (1.1597→1.1588). | runpod | 1.1588 | 15.1MB | keep | $7.87 | | 4/1/2026 |
| EGGROLL uint16 crash: val_tokens passed as CUDAUInt16Type to nn.Embedding in eggroll_refine(). Training completed OK (val_bpb=1.1832 at step 5828, post-EMA=1.1821). GPTQ quantized OK. Crashed at eggroll_refine line 2821 _eval_loss(). Fixed in 8ab13bd by casting val_tokens to int64. | runpod | — | — | crash | $5.14 | | 4/1/2026 |
| 8xH100 NorMuon VR (MUON_VR=1) + MuonEq RC + QK_GAIN=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. val_bpb=1.1597 (exact 1.15973874). Post-EMA=1.1820. Sliding window=1.1646 (exact 1.16464033). SLOT contribution=-0.0049. VR contribution: -0.0002 vs prior best (1.1599→1.1597). Marginal — VR local signal (-0.022) did not translate at H100 scale. 5823 steps in 600s. Artifact 15.15MB + code 143KB = 15.30MB (COMPLIANT). | runpod | 1.1597 | 15.3MB | keep | $7.53 | | 4/1/2026 |
| NorMuon VR CRASH: VR buffer shape mismatch during warmup. VR bufs sized for shard shape (padded_B/8) but warmup uses non-sharded path with full bank shape. RuntimeError at line 656. Fixed in c1fbd65 by gating VR on sharded=True. | runpod | — | — | crash | $1.37 | | 4/1/2026 |
| NorMuon VR (Adafactor-style variance reduction AFTER NS5): MUON_VR=1 MUON_VR_BETA2=0.95. val_bpb 9.349 vs MuonEq baseline 9.371 (-0.022 improvement). Train loss 6.78→6.44. EMA applied. 758s (500 iters). VR redistributes per-row/col update energy; complementary to MuonEq RC (which equilibrates BEFORE NS5). Positive directional signal for H100. | local | 9.3488 | 0.0MB | keep | — | | 4/1/2026 |
| 8xH100 MuonEq RC equilibration + QK_GAIN=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. NEW BEST. val_bpb=1.1599 (exact 1.15986217). Post-EMA=1.1823. Sliding window=1.1650 (exact 1.16503774). SLOT contribution=-0.0051. MuonEq contribution: -0.0036 vs previous best (1.1635→1.1599). 5850 steps in 600s, 102.60ms/step. Artifact 15.15MB + code 141KB = 15.29MB (COMPLIANT). MuonEq adds row/col normalization before NS5 for better spectral conditioning. | runpod | 1.1599 | 15.3MB | keep | $7.52 | | 4/1/2026 |
| 8xH100 QK_GAIN_INIT=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. val_bpb=1.1635 (exact 1.16349930). Post-EMA=1.1857. Sliding window=1.1683. SLOT contribution=-0.0048 (vs -0.0018 at lr=0.003/steps=5). QK_GAIN=4.0 contribution ~-0.001. 5878 steps in 600s. Artifact 15.09MB (COMPLIANT). Best result so far. | runpod | 1.1635 | 15.1MB | keep | $7.52 | | 4/1/2026 |
| 8xH100 INT6 GPTQ (clip_range=31) + brotli-11 + SLOT (lr=0.003 steps=5). val_bpb=1.1677 (exact 1.16769675) vs baseline 1.1695 = -0.0018 improvement from SLOT. SLOT eval took 251s. Training: 5851 steps in 600s. Artifact 15.09MB (COMPLIANT). SLOT gain disappointing (-0.0018 vs expected -0.020). Delta is [bsz,1,dim] — may need per-position delta or different lr/steps. Next: try per-position delta or wallclock-aware warmdown. | runpod | 1.1677 | 15.1MB | keep | $7.36 | | 4/1/2026 |
| 8xH100 INT6 GPTQ (clip_range=31) + brotli-11 + P2 DISABLED + 80 shards. FIRST COMPLIANT RUN. val_bpb=1.1695 (exact 1.16954732). Post-EMA=1.1868. 5856 steps in 600s. Artifact 14.95MB + code 131KB = 15.08MB (UNDER 16MB). Improvement: -0.001 vs prior baseline (1.1705→1.1695) AND artifact now compliant. INT6+brotli: 15.08MB vs INT5+brotli 11.66MB vs INT6+zstd 16.42MB. Next: add SLOT for -0.020-0.029 bpb. | runpod | 1.1695 | 15.1MB | keep | $5.99 | | 4/1/2026 |
| 8xH100 BASELINE with P2 DISABLED + INT5 GPTQ (clip_range=15) + brotli-11 + 80 shards. val_bpb=1.1912 (exact 1.19124857). Post-EMA=1.1867. 5849 steps in 600s, 102ms/step. Artifact 11.53MB + code 131KB = 11.66MB (4.3MB UNDER 16MB budget). GPTQ degradation only +0.0045 (1.1867→1.1912). Improvement: -0.021 bpb vs prior baseline (1.1705→1.1912, but prior had P2 enabled accidentally + only 32 shards). Clean compliant run. Next: try INT6 GPTQ for better quality, or add SLOT/TTT. | runpod | 1.1912 | 11.7MB | keep | $5.87 | | 4/1/2026 |
| 1-GPU run (RUNPOD_GPU_COUNT=1 in .env, dotenv override=True). Only 854/20000 steps in 600s (batch=786K tokens, grad_accum=8, 703ms/step). Pre-GPTQ val_bpb=1.3879. INT5+brotli=3.55MB. Post-GPTQ val_bpb=5.68 (undertrained model destroyed by quantization). Fixed .env to RUNPOD_GPU_COUNT=8. | runpod | 5.6755 | 3.7MB | discard | $1.13 | | 4/1/2026 |
| 1-GPU run: only 851/20000 steps (batch too large for 1 GPU — 786K tokens, grad_accum=8, 705ms/step). Pre-GPTQ val_bpb=1.3888. GPTQ int5+brotli compressed to 3.54MB but catastrophically degraded undertrained model to val_bpb=5.79. Root cause: RUNPOD_GPU_COUNT=1 with 8xH100 batch size. Need 8 GPUs or smaller batch. | runpod | 5.7898 | 3.7MB | discard | $1.11 | | 4/1/2026 |
| runpod | — | — | crash | $5.30 | | 4/1/2026 |
| P2 focal loss fix (mean not weighted-mean) + brotli-11 + INT5 GPTQ. val_bpb=1.2377 (REGRESSION from 1.1705 baseline, +0.067). P2 loss CONFIRMED HARMFUL at H100 scale — downweights confident tokens, reduces effective gradient. 5879 steps in 600s. Artifact 11.63MB (under budget). Brotli infra works. CONCLUSION: disable P2 loss for next run. | runpod | 1.2377 | 11.6MB | keep | $5.99 | | 4/1/2026 |
| Brotli-11 compression + P2 focal loss. val_bpb=1.2424 (REGRESSION from 1.1705 baseline, +0.072). Artifact 11.63MB (well under 16MB — brotli saved 4.8MB vs zstd). P2 loss normalization bug: w.sum() division dampened gradients. 5845 steps in 600s. Post-EMA 1.2327. Brotli infra works; P2 impl needs fix (use mean not weighted-mean). | runpod | 1.2424 | 11.6MB | keep | $5.99 | | 4/1/2026 |
| H100 BASELINE SUCCESS: Rascal PR #1120 + HF data download fix. val_bpb=1.1705 (sliding window stride=64). Post-EMA 1.1876. 5825 steps in 601s. GPTQ int6+zstd=16.29MB + code 131KB = 16.42MB total (420KB OVER 16MB limit). Fix: minify code or reduce params. First clean H100 run. | runpod | 1.1705 | 16.4MB | keep | $5.30 | | 4/1/2026 |
| Val data not found: FileNotFoundError fineweb_val_*.bin. data/datasets/ dir empty on pod. Fixed by adding _ensure_data() HF download fallback to train_gpt.py. | runpod | — | — | crash | $0.52 | | 4/1/2026 |
| Tokenizer not found: OSError ./data/tokenizers/fineweb_1024_bpe.model. Data sync infra failed to download tokenizer. Fixed by adding _ensure_data() HF download fallback to train_gpt.py. | runpod | — | — | crash | $0.55 | | 4/1/2026 |
| runpod | — | — | crash | — | | 4/1/2026 |
| H100 Cycle: SSH probe failed (5/5 attempts) on pod ktn8s2tczju3p1. Second consecutive SSH failure. Transient RunPod infra issue. Code verified OK (SHA match, syntax OK, constraints feasible). Retry immediately. | runpod | — | — | crash | — | | 4/1/2026 |
| H100 Cycle: SSH probe failed (5/5 attempts) on pod 34o0wenduj5tre. Pod came up RUNNING but SSH never connected. Transient infra issue — code verified OK (SHA match, syntax OK, constraints feasible). Retry next cycle. | runpod | — | — | crash | — | | 4/1/2026 |
| H100 promotion of unmodified Rascal PR #1120 SOTA (2477 lines, readable). Rate-limited (<1h since last run). Will retry next cycle. | local | — | — | rate_limited | — | | 4/1/2026 |
| H100 promotion of unmodified PR #1089 SOTA (obfuscated 24KB). Rate-limited (<1h since last run). Will retry next cycle. | local | — | — | rate_limited | — | | 4/1/2026 |
| runpod | — | — | crash | — | | 4/1/2026 |
| runpod | — | — | crash | $4.81 | | 4/1/2026 |
| P2 focal loss (PR #1180): (1-p)^2 * CE weighting. 500 iters, MuonEq+NS5+QK_GAIN=4.0+XSA-all+EngramLite+LEAKY=0.75. val_bpb 9.351 vs MuonEq baseline 9.371 (-0.020). Train loss 6.75→6.41. Clear positive signal at local scale. | local | 9.3512 | 0.0MB | keep | — | | 4/1/2026 |
| H100 Cycle 25: training OK (val_bpb 1.1830 at step 5830, +EGGROLL 34 improvements), TTT timed out (SSH 1800s limit exceeded by HF download ~327s + GPTQ 38s + EGGROLL 23s). zstd STILL failing (externally-managed env). Artifact 16.79MB > 16MB (zlib). Fixes: --break-system-packages, timeout 1800→2400s. | runpod | 1.1830 | 16.8MB | crash | $10.02 | | 4/1/2026 |
| H100 Cycle 24: EGGROLL crash (local_tokens=1024 < seq_len=2048 due to world_size div bug) + HF download 404 (wrong subfolder) + zstd silent fail. val_bpb at step 4000 was 1.2821 before crash. 3 bugs fixed in commit 9511961. | runpod | — | — | crash | $4.16 | | 4/1/2026 |
| H100 Cycle 23 baseline: NS5+MuonEq+LEAKY=0.75+XSA-all+EngramLite+GPTQ-zlib (TTT disabled). Artifact 188KB over 16MB limit — zstandard not installed on pod (fell back to zlib). Fix: install zstandard before torchrun. | runpod | 1.3500 | 17.0MB | keep | $4.72 | | 4/1/2026 |
| Port MuonEq RC + TTT-30 cosine to train_gpt.py; fix run.log; H100 run b80e17d triggered (runpod_b80e17d_*) | local | — | — | infra | — | | 4/1/2026 |
| GPTQ implementation: full Hessian int6+zstd-22 added to sota_1120_rascal_train_gpt.py and train_gpt.py. Unblocks Tier 2. Calibration via forward hooks (h_attn_in/out/mlp_in/act), Cholesky H_inv, actorder, 5-way clip sweep, zstd-22 compression. | local | — | — | infra | — | | 4/1/2026 |
| TTT-3 smoke test: TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768, MAX_VAL_TOKENS=131072. val_bpb 9.380 vs 9.371 MuonEq baseline — neutral at 500 steps (expected: undertrained model has too-high loss for TTT signal). TTT takes 108.9s for 4 chunks (validated impl). H100 scale (7000 steps, well-trained) should show full -0.041 bpb gain from 30 epochs. | local | 9.3800 | 0.0MB | keep | — | research_results.jsonl | 4/1/2026 |
| MuonEq RC equilibration before NS5 (arxiv:2603.28254): per-row then per-col normalization of gradient matrix before NS iterations. val_bpb 9.371 vs NS5 baseline 9.474 (-0.103). Train loss 6.76→6.38. 2.75s/step. Novel technique not yet on leaderboard. Closes gap vs Turbo-Muon (9.354) while keeping standard NS5 5-iter that works at H100 scale. | local | 9.3708 | 0.0MB | keep | — | research_results.jsonl | 4/1/2026 |
| ResidLambdas: x0_lambda init=0.1 (vs resid_mix init=0.0). val_bpb 9.506 vs 9.474 baseline (-0.031 worse). Scale-dependent: PR #1130 uses it at H100 scale (7000+ steps). At 500 steps the x0 injection adds noise before the model can benefit. Reverted. Note: attn_scale/mlp_scale (init=1.0) already implement the resid_lambda part; only x0 injection was new. | local | 9.5056 | 0.0MB | discard | — | | 4/1/2026 |
| Standard NS5 Muon (cubic, a=15/8 b=-5/4 c=3/8) + LEAKY_SLOPE=0.75. 500 iters. val_bpb 9.474 vs 9.354 Turbo-Muon baseline — expected regression at 500 steps (Turbo-Muon faster short-term convergence). H100 correct: PR #1105 confirmed Turbo-Muon +0.0018 BPB worse at 7000+ steps. NS5 now canonical for H100 runs. | local | 9.4739 | 0.0MB | keep | — | | 4/1/2026 |
| Coprime loader v2 (correct stride=200001≈n/500). Neutral local result (9.356 vs 9.354 baseline, within noise). Train loss 6.77→6.48, same as sequential. Expected: no local improvement (100M shard too large for 500-step coverage). Ready for H100 deployment where full shard coverage matters. | local | 9.3555 | 0.0MB | keep | — | program.md | 4/1/2026 |
| Coprime loader v1 (BAD stride=n/2): alternates between 2 positions only. Phase=53M hit low-entropy data (train loss 4.69 vs normal 6.73). val_bpb 9.919 WORSE than baseline. Bug: stride should be n//total_steps not n//2. | local | 9.9186 | 0.0MB | discard | — | | 4/1/2026 |
| LEAKY_SLOPE=0.75 (vs 0.3). Turbo-Muon unchanged. 500 iters. val_bpb 9.354 vs 9.366 (-0.012). Train loss 6.73->6.48. PR #1135 uses 0.75, positive direction confirmed. | local | 9.3537 | 0.0MB | keep | — | program.md | 4/1/2026 |
| Turbo-Muon (Polar Express 4-iter + AOL preconditioning). 500 iters. val_bpb 9.366 vs 9.623 (-0.257). Train loss 6.73→6.49. BUT 1233s run time (>10min limit) due to EMA eval every step. Fix needed. | local | 9.3661 | 0.0MB | keep | — | research_results.jsonl | 4/1/2026 |
| Muon (NS5 lr=0.025) + QK_GAIN=4.0 + LEAKY_SLOPE=0.3. 200 iters. Loss 6.75→6.63 (learning signal!), AdamW was flat at 6.93. val_bpb 9.623 vs 10.007 baseline (-0.384 improvement). 2.62s/step. Clear Muon win. | local | 9.6233 | 0.0MB | keep | — | | 4/1/2026 |
| Baseline: AdamW 3e-4, XSA-all, EngramLite, 200 iters, 30.7M params. Loss stuck at random init (1.6M tokens insufficient to show learning). Establishes timing: 2s/step on M4. | local | 10.0074 | 0.0MB | keep | — | | 4/1/2026 |