Val bpb over time

Description Tier Val bpb Size Status Cost Source Date
runpod1.1200keep$3.504/7/2026
localkeep4/7/2026
runpodcrash4/1/2026
8xH100 Legal Score-First TTT (3ep SGD cosine) + SLOT lr=0.005 steps=8 + EGGROLL + MuonEq RC + QK_GAIN=4.0 + INT6 GPTQ + brotli-11 + coprime loader + WARMDOWN_ITERS=4000. NEW BEST. val_bpb=1.1563 (exact 1.15630221, SLOT). TTT=1.1628, Post-EMA=1.1816. TTT contribution=-0.0188, SLOT contribution=-0.0065, EGGROLL: 16 improvements. 5879 steps in 600s, 102ms/step. Artifact 14.99MB + code ~160KB = 15.15MB (COMPLIANT). Improvement: -0.0025 vs prior best (1.1588→1.1563). TTT is the biggest eval-time gain so far.runpod1.156315.1MBkeep$10.434/1/2026
8xH100 LAWA_ENABLED=1 (k=10, freq=100) vs EMA baseline. REGRESSION: val_bpb=1.1612 (exact 1.16123862, SLOT) vs prior best 1.1588 (+0.0024). Sliding=1.1670, Post-EMA=1.1837. SLOT contribution=-0.0058. EGGROLL: 20 improvements. 5818 steps in 600s, 103ms/step. Artifact 15.02MB + code 150KB = 15.17MB (COMPLIANT). lawa:applying LAWA averaging k=10. Conclusion: EMA (decay=0.997) outperforms LAWA (k=10,freq=100) — revert.runpod1.161215.2MBkeep$7.904/1/2026
8xH100 WARMDOWN_ITERS=2000 (vs 4000). REGRESSION: val_bpb=1.1669 (exact 1.16687216, SLOT) vs prior best 1.1588 (+0.0081). Sliding=1.1700, Post-EMA=1.1892. SLOT contribution=-0.0031 (vs prior -0.0061). EGGROLL: 15 improvements. 5829 steps in 600s. Artifact 15.88MB (COMPLIANT). Conclusion: WARMDOWN_ITERS=4000 is better — revert.runpod1.166915.9MBkeep$7.904/1/2026
8xH100 EGGROLL fix + WARMDOWN_ITERS=4000 + VR disabled. NEW BEST. val_bpb=1.1588 (exact 1.15880324, SLOT). Post-EMA=1.1814. Sliding window=1.1649 (exact 1.16489309). SLOT contribution=-0.0061. EGGROLL: 20 improvements in 60s (2220 tested). 5863 steps in 600s, 102.35ms/step. Artifact 14.99MB + code 150KB = 15.14MB (COMPLIANT). Improvement: -0.0009 vs prior best (1.1597→1.1588).runpod1.158815.1MBkeep$7.874/1/2026
EGGROLL uint16 crash: val_tokens passed as CUDAUInt16Type to nn.Embedding in eggroll_refine(). Training completed OK (val_bpb=1.1832 at step 5828, post-EMA=1.1821). GPTQ quantized OK. Crashed at eggroll_refine line 2821 _eval_loss(). Fixed in 8ab13bd by casting val_tokens to int64.runpodcrash$5.144/1/2026
8xH100 NorMuon VR (MUON_VR=1) + MuonEq RC + QK_GAIN=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. val_bpb=1.1597 (exact 1.15973874). Post-EMA=1.1820. Sliding window=1.1646 (exact 1.16464033). SLOT contribution=-0.0049. VR contribution: -0.0002 vs prior best (1.1599→1.1597). Marginal — VR local signal (-0.022) did not translate at H100 scale. 5823 steps in 600s. Artifact 15.15MB + code 143KB = 15.30MB (COMPLIANT).runpod1.159715.3MBkeep$7.534/1/2026
NorMuon VR CRASH: VR buffer shape mismatch during warmup. VR bufs sized for shard shape (padded_B/8) but warmup uses non-sharded path with full bank shape. RuntimeError at line 656. Fixed in c1fbd65 by gating VR on sharded=True.runpodcrash$1.374/1/2026
NorMuon VR (Adafactor-style variance reduction AFTER NS5): MUON_VR=1 MUON_VR_BETA2=0.95. val_bpb 9.349 vs MuonEq baseline 9.371 (-0.022 improvement). Train loss 6.78→6.44. EMA applied. 758s (500 iters). VR redistributes per-row/col update energy; complementary to MuonEq RC (which equilibrates BEFORE NS5). Positive directional signal for H100.local9.34880.0MBkeep4/1/2026
8xH100 MuonEq RC equilibration + QK_GAIN=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. NEW BEST. val_bpb=1.1599 (exact 1.15986217). Post-EMA=1.1823. Sliding window=1.1650 (exact 1.16503774). SLOT contribution=-0.0051. MuonEq contribution: -0.0036 vs previous best (1.1635→1.1599). 5850 steps in 600s, 102.60ms/step. Artifact 15.15MB + code 141KB = 15.29MB (COMPLIANT). MuonEq adds row/col normalization before NS5 for better spectral conditioning.runpod1.159915.3MBkeep$7.524/1/2026
8xH100 QK_GAIN_INIT=4.0 + SLOT lr=0.005 steps=8 + INT6 GPTQ + brotli-11 + coprime loader. val_bpb=1.1635 (exact 1.16349930). Post-EMA=1.1857. Sliding window=1.1683. SLOT contribution=-0.0048 (vs -0.0018 at lr=0.003/steps=5). QK_GAIN=4.0 contribution ~-0.001. 5878 steps in 600s. Artifact 15.09MB (COMPLIANT). Best result so far.runpod1.163515.1MBkeep$7.524/1/2026
8xH100 INT6 GPTQ (clip_range=31) + brotli-11 + SLOT (lr=0.003 steps=5). val_bpb=1.1677 (exact 1.16769675) vs baseline 1.1695 = -0.0018 improvement from SLOT. SLOT eval took 251s. Training: 5851 steps in 600s. Artifact 15.09MB (COMPLIANT). SLOT gain disappointing (-0.0018 vs expected -0.020). Delta is [bsz,1,dim] — may need per-position delta or different lr/steps. Next: try per-position delta or wallclock-aware warmdown.runpod1.167715.1MBkeep$7.364/1/2026
8xH100 INT6 GPTQ (clip_range=31) + brotli-11 + P2 DISABLED + 80 shards. FIRST COMPLIANT RUN. val_bpb=1.1695 (exact 1.16954732). Post-EMA=1.1868. 5856 steps in 600s. Artifact 14.95MB + code 131KB = 15.08MB (UNDER 16MB). Improvement: -0.001 vs prior baseline (1.1705→1.1695) AND artifact now compliant. INT6+brotli: 15.08MB vs INT5+brotli 11.66MB vs INT6+zstd 16.42MB. Next: add SLOT for -0.020-0.029 bpb.runpod1.169515.1MBkeep$5.994/1/2026
8xH100 BASELINE with P2 DISABLED + INT5 GPTQ (clip_range=15) + brotli-11 + 80 shards. val_bpb=1.1912 (exact 1.19124857). Post-EMA=1.1867. 5849 steps in 600s, 102ms/step. Artifact 11.53MB + code 131KB = 11.66MB (4.3MB UNDER 16MB budget). GPTQ degradation only +0.0045 (1.1867→1.1912). Improvement: -0.021 bpb vs prior baseline (1.1705→1.1912, but prior had P2 enabled accidentally + only 32 shards). Clean compliant run. Next: try INT6 GPTQ for better quality, or add SLOT/TTT.runpod1.191211.7MBkeep$5.874/1/2026
1-GPU run (RUNPOD_GPU_COUNT=1 in .env, dotenv override=True). Only 854/20000 steps in 600s (batch=786K tokens, grad_accum=8, 703ms/step). Pre-GPTQ val_bpb=1.3879. INT5+brotli=3.55MB. Post-GPTQ val_bpb=5.68 (undertrained model destroyed by quantization). Fixed .env to RUNPOD_GPU_COUNT=8.runpod5.67553.7MBdiscard$1.134/1/2026
1-GPU run: only 851/20000 steps (batch too large for 1 GPU — 786K tokens, grad_accum=8, 705ms/step). Pre-GPTQ val_bpb=1.3888. GPTQ int5+brotli compressed to 3.54MB but catastrophically degraded undertrained model to val_bpb=5.79. Root cause: RUNPOD_GPU_COUNT=1 with 8xH100 batch size. Need 8 GPUs or smaller batch.runpod5.78983.7MBdiscard$1.114/1/2026
runpodcrash$5.304/1/2026
P2 focal loss fix (mean not weighted-mean) + brotli-11 + INT5 GPTQ. val_bpb=1.2377 (REGRESSION from 1.1705 baseline, +0.067). P2 loss CONFIRMED HARMFUL at H100 scale — downweights confident tokens, reduces effective gradient. 5879 steps in 600s. Artifact 11.63MB (under budget). Brotli infra works. CONCLUSION: disable P2 loss for next run.runpod1.237711.6MBkeep$5.994/1/2026
Brotli-11 compression + P2 focal loss. val_bpb=1.2424 (REGRESSION from 1.1705 baseline, +0.072). Artifact 11.63MB (well under 16MB — brotli saved 4.8MB vs zstd). P2 loss normalization bug: w.sum() division dampened gradients. 5845 steps in 600s. Post-EMA 1.2327. Brotli infra works; P2 impl needs fix (use mean not weighted-mean).runpod1.242411.6MBkeep$5.994/1/2026
H100 BASELINE SUCCESS: Rascal PR #1120 + HF data download fix. val_bpb=1.1705 (sliding window stride=64). Post-EMA 1.1876. 5825 steps in 601s. GPTQ int6+zstd=16.29MB + code 131KB = 16.42MB total (420KB OVER 16MB limit). Fix: minify code or reduce params. First clean H100 run.runpod1.170516.4MBkeep$5.304/1/2026
Val data not found: FileNotFoundError fineweb_val_*.bin. data/datasets/ dir empty on pod. Fixed by adding _ensure_data() HF download fallback to train_gpt.py.runpodcrash$0.524/1/2026
Tokenizer not found: OSError ./data/tokenizers/fineweb_1024_bpe.model. Data sync infra failed to download tokenizer. Fixed by adding _ensure_data() HF download fallback to train_gpt.py.runpodcrash$0.554/1/2026
runpodcrash4/1/2026
H100 Cycle: SSH probe failed (5/5 attempts) on pod ktn8s2tczju3p1. Second consecutive SSH failure. Transient RunPod infra issue. Code verified OK (SHA match, syntax OK, constraints feasible). Retry immediately.runpodcrash4/1/2026
H100 Cycle: SSH probe failed (5/5 attempts) on pod 34o0wenduj5tre. Pod came up RUNNING but SSH never connected. Transient infra issue — code verified OK (SHA match, syntax OK, constraints feasible). Retry next cycle.runpodcrash4/1/2026
H100 promotion of unmodified Rascal PR #1120 SOTA (2477 lines, readable). Rate-limited (<1h since last run). Will retry next cycle.localrate_limited4/1/2026
H100 promotion of unmodified PR #1089 SOTA (obfuscated 24KB). Rate-limited (<1h since last run). Will retry next cycle.localrate_limited4/1/2026
runpodcrash4/1/2026
runpodcrash$4.814/1/2026
P2 focal loss (PR #1180): (1-p)^2 * CE weighting. 500 iters, MuonEq+NS5+QK_GAIN=4.0+XSA-all+EngramLite+LEAKY=0.75. val_bpb 9.351 vs MuonEq baseline 9.371 (-0.020). Train loss 6.75→6.41. Clear positive signal at local scale.local9.35120.0MBkeep4/1/2026
H100 Cycle 25: training OK (val_bpb 1.1830 at step 5830, +EGGROLL 34 improvements), TTT timed out (SSH 1800s limit exceeded by HF download ~327s + GPTQ 38s + EGGROLL 23s). zstd STILL failing (externally-managed env). Artifact 16.79MB > 16MB (zlib). Fixes: --break-system-packages, timeout 1800→2400s.runpod1.183016.8MBcrash$10.024/1/2026
H100 Cycle 24: EGGROLL crash (local_tokens=1024 < seq_len=2048 due to world_size div bug) + HF download 404 (wrong subfolder) + zstd silent fail. val_bpb at step 4000 was 1.2821 before crash. 3 bugs fixed in commit 9511961.runpodcrash$4.164/1/2026
H100 Cycle 23 baseline: NS5+MuonEq+LEAKY=0.75+XSA-all+EngramLite+GPTQ-zlib (TTT disabled). Artifact 188KB over 16MB limit — zstandard not installed on pod (fell back to zlib). Fix: install zstandard before torchrun.runpod1.350017.0MBkeep$4.724/1/2026
Port MuonEq RC + TTT-30 cosine to train_gpt.py; fix run.log; H100 run b80e17d triggered (runpod_b80e17d_*)localinfra4/1/2026
GPTQ implementation: full Hessian int6+zstd-22 added to sota_1120_rascal_train_gpt.py and train_gpt.py. Unblocks Tier 2. Calibration via forward hooks (h_attn_in/out/mlp_in/act), Cholesky H_inv, actorder, 5-way clip sweep, zstd-22 compression.localinfra4/1/2026
TTT-3 smoke test: TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768, MAX_VAL_TOKENS=131072. val_bpb 9.380 vs 9.371 MuonEq baseline — neutral at 500 steps (expected: undertrained model has too-high loss for TTT signal). TTT takes 108.9s for 4 chunks (validated impl). H100 scale (7000 steps, well-trained) should show full -0.041 bpb gain from 30 epochs.local9.38000.0MBkeepresearch_results.jsonl4/1/2026
MuonEq RC equilibration before NS5 (arxiv:2603.28254): per-row then per-col normalization of gradient matrix before NS iterations. val_bpb 9.371 vs NS5 baseline 9.474 (-0.103). Train loss 6.76→6.38. 2.75s/step. Novel technique not yet on leaderboard. Closes gap vs Turbo-Muon (9.354) while keeping standard NS5 5-iter that works at H100 scale.local9.37080.0MBkeepresearch_results.jsonl4/1/2026
ResidLambdas: x0_lambda init=0.1 (vs resid_mix init=0.0). val_bpb 9.506 vs 9.474 baseline (-0.031 worse). Scale-dependent: PR #1130 uses it at H100 scale (7000+ steps). At 500 steps the x0 injection adds noise before the model can benefit. Reverted. Note: attn_scale/mlp_scale (init=1.0) already implement the resid_lambda part; only x0 injection was new.local9.50560.0MBdiscard4/1/2026
Standard NS5 Muon (cubic, a=15/8 b=-5/4 c=3/8) + LEAKY_SLOPE=0.75. 500 iters. val_bpb 9.474 vs 9.354 Turbo-Muon baseline — expected regression at 500 steps (Turbo-Muon faster short-term convergence). H100 correct: PR #1105 confirmed Turbo-Muon +0.0018 BPB worse at 7000+ steps. NS5 now canonical for H100 runs.local9.47390.0MBkeep4/1/2026
Coprime loader v2 (correct stride=200001≈n/500). Neutral local result (9.356 vs 9.354 baseline, within noise). Train loss 6.77→6.48, same as sequential. Expected: no local improvement (100M shard too large for 500-step coverage). Ready for H100 deployment where full shard coverage matters.local9.35550.0MBkeepprogram.md4/1/2026
Coprime loader v1 (BAD stride=n/2): alternates between 2 positions only. Phase=53M hit low-entropy data (train loss 4.69 vs normal 6.73). val_bpb 9.919 WORSE than baseline. Bug: stride should be n//total_steps not n//2.local9.91860.0MBdiscard4/1/2026
LEAKY_SLOPE=0.75 (vs 0.3). Turbo-Muon unchanged. 500 iters. val_bpb 9.354 vs 9.366 (-0.012). Train loss 6.73->6.48. PR #1135 uses 0.75, positive direction confirmed.local9.35370.0MBkeepprogram.md4/1/2026
Turbo-Muon (Polar Express 4-iter + AOL preconditioning). 500 iters. val_bpb 9.366 vs 9.623 (-0.257). Train loss 6.73→6.49. BUT 1233s run time (>10min limit) due to EMA eval every step. Fix needed.local9.36610.0MBkeepresearch_results.jsonl4/1/2026
Muon (NS5 lr=0.025) + QK_GAIN=4.0 + LEAKY_SLOPE=0.3. 200 iters. Loss 6.75→6.63 (learning signal!), AdamW was flat at 6.93. val_bpb 9.623 vs 10.007 baseline (-0.384 improvement). 2.62s/step. Clear Muon win.local9.62330.0MBkeep4/1/2026
Baseline: AdamW 3e-4, XSA-all, EngramLite, 200 iters, 30.7M params. Loss stuck at random init (1.6M tokens insufficient to show learning). Establishes timing: 2s/step on M4.local10.00740.0MBkeep4/1/2026
Parameter Golf Dashboard Last updated: 4/1/2026, 8:01:18 PM