Parameter Golf

Research pipeline

Fetched

713

Graded

656

Verified

Injected

149

Tier breakdown

Tier A: 22Tier B: 40Tier C: 536

Summary	Score	Tier	Verified	Graded
Current SOTA (1.1091 bpb). Introduces Turbo-Muon optimizer (4 NS iters + Polar Express), EngramLite (multi-head prime hash extending BigramHash), ParamBanking for parameter efficiency, and XSA on all 11 layers. The winning combination that beats all other entries.	16.0	A	No	3/30/2026
Top leaderboard submission at 1.0577 bpb. P2 loss ((1-p)^2 difficulty-aware weighting) is a novel and simple loss modification (~5 lines). Wallclock-aware LR warmdown and conv token mixer are new directions not in our proven techniques. The 0.29 bpb gap over SOTA makes this the highest-priority item to study and adapt.	15.0	A	No	3/31/2026
Merged record achieving 1.1147 bpb (3-seed mean) within 15.91 MB and 600s on 8xH100. The key contribution is AR self-generated calibration data for GPTQ, which avoids validation data access during quantization — a novel and rules-compliant approach. Companion mechanistic interpretability analysis adds confidence in the method.	15.0	A	No	3/30/2026
Second-best score (1.1099 bpb) 'Rascal' entry. Uses XSA-all, Parallel Muon, Coprime-stride loader, BigramHash(2048), naive int6+zstd. Proves architecture+training quality can near-match SOTA without fancy quantization.	15.0	A	No	3/30/2026
45 experiments on 1×RTX 5090 testing XSA variants. KEY FINDING: XSA on ALL 11 layers beats XSA on last-4 by -0.0018 bpb. Also tests QK Gain 4.0 + LN Scale. Invaluable ablation study for XSA implementation.	14.0	A	No	3/30/2026
1.1140 bpb WITHOUT TTT. ResidLambdas (learned per-layer residual scaling), Split-LR, Train-Budget GPTQ (quantization within training budget), Coprime-stride loader. Most statistically rigorous (12-seed mean). Proves strong training alone suffices.	14.0	A	No	3/30/2026
1.1138 bpb via Fused MLP (Triton+CUTLASS EVT) + Brotli compression + Memmap loading. Systems-level optimization: fused kernels save ~1.8ms/step enabling hundreds more training steps. Brotli achieves better compression than zstd.	14.0	A	No	3/30/2026
Current leaderboard SOTA at 1.0781 bpb via 30-epoch cosine TTT on the LeakyReLU² stack, validated across 3 seeds with tight std=0.0041. Already listed as top competitor — no novelty for our implementation, but strongly validates TTT epoch scaling as the highest-impact single lever available.	13.0	A	No	3/31/2026
Combines Full GPTQ XSA11 with a novel online n-gram eval-time ensemble (best_agree) that boosts model distribution using prefix-only token/word experts. Achieves 1.1109 bpb. The eval-time augmentation technique is modular and implementable, but this is already a tracked competitor submission.	13.0	A	No	3/31/2026
Shared ValueEmbedding reuses tied tok_emb instead of training separate VE weights, freeing parameter budget to expand VE from 2 to 6 layers. Achieves 1.1201 bpb (3-seed mean, std 0.0002)—not SOTA but demonstrates a clean parameter-efficiency trade-off. The technique is well-documented with reproducible results, though it builds on already-proven components (LeakyReLU², Parameter Banking, TTT).	13.0	A	No	3/31/2026
1.1146 bpb (1-seed). EngramLite + Gated Skips + Full Hessian GPTQ + FlashAttention 3. Validates EngramLite works with gated skip connections. FA3 enables faster attention computation.	13.0	A	No	3/30/2026
1.1154 bpb. SLOT (Stochastic Logit Overlay at Test-time) + LeakyReLU² + Legal Score-First TTT + Parallel Muon. Introduces SLOT as a novel eval-time augmentation technique orthogonal to training improvements.	13.0	A	No	3/30/2026
PR #173 on parameter-golf achieves 1.1532 bpb combining NorMuon with Int6, MLP 3x, and Flash Attention 3. Directly validates NorMuon in the challenge context with measurable bpb. Most actionable item in this batch as it provides a concrete parameter-golf submission with code.	12.0	A	No	3/31/2026
Introduces several novel techniques: sigmoid-gated U-Net skip connections, soft-round QAT (temperature-controlled rounding replacing STE), Brotli-11+byte-shuffle compression saving ~400KB, and code minification saving 78KB. Multiple compression innovations open new artifact budget headroom. Higher implementation complexity due to multiple interacting novelties.	12.0	A	No	3/31/2026
Strong 3-seed validated result at 1.1015 bpb with Split-LR (different LRs for early/late layers). Most components (SLOT, GPTQ, XSA-all) are already proven, but Split-LR is a simple and novel hyperparameter technique. Tight on time budget (600s train + 177s eval = 777s total, near the limit).	12.0	A	No	3/31/2026
Directly targets the Muon optimizer already used in the challenge (Parallel Muon, Turbo-Muon). Proposes lightweight O(m+n) row/column equilibration before Newton-Schulz orthogonalization, improving spectral conditioning of the momentum matrix. Could improve Muon convergence and thus bpb with negligible overhead and no impact on model size. Novel pre-orthogonalization direction not yet explored on the leaderboard.	12.0	A	No	3/31/2026
Current leaderboard leader at 1.08056 bpb via novel Scylla tokenizer and legal score-first TTT. Massive 3.47% improvement over merged SOTA. The custom tokenizer represents a fundamentally different approach but is complex to replicate and already tracked as the top competitor submission.	12.0	A	No	3/31/2026
Post-GPTQ quantization refinement via Antithetic Ternary Bin Search yields 1.1161 bpb (3-seed mean, std 0.0001) on top of Kitchen Sink V2. Modest improvement over merged SOTA (1.1194) with tight variance. The bin-search refinement over ternary quantization is a novel direction, though the overall result trails top competitors.	12.0	A	No	3/31/2026
Evidence-aware Dirichlet concentration for n-gram mixing: a one-line change that scales concentration inversely with log context count. Claims +0.94% FineWeb improvement with proper normalization. Extremely easy to implement but the improvement is on the n-gram component only—full pipeline bpb delta may be small, and many n-gram cache tricks have been invalidated (see #677).	12.0	A	No	3/31/2026
1.1217 bpb. Adaptive Precision Embedding Quantization — variable bit-width per embedding dimension based on importance. Reduces quantization error on critical embedding dimensions. 4-seed mean.	12.0	A	No	3/30/2026
1.1187 bpb. 11L XSA4 + TrigramHash + ValueResidual + Legal TTT. ValueResidual is a technique where value projections get a direct residual path, improving gradient flow through attention layers.	12.0	A	No	3/30/2026
1.1174 bpb. CROWN-Q (learned quantization grid) + GPTQ + Legal TTT. CROWN-Q learns optimal quantization boundaries per-layer, reducing quantization error vs fixed int6 grid.	12.0	A	No	3/30/2026
EmergentMind overview of NorMuon with explicit mathematical formulas for the per-neuron second-order momentum and row-wise normalization steps. Most implementation-ready source with the exact equations needed to modify the existing Muon optimizer.	11.0	B	No	3/31/2026
Full arXiv HTML of NorMuon paper with detailed algorithm description: Newton-Schulz orthogonalization followed by row-wise second-order momentum normalization. Contains the exact update rules needed for implementation in train_gpt.py's Muon optimizer.	11.0	B	No	3/31/2026
Muon+ proposes an additional normalization step that claims to outperform NorMuon. It directly compares against NorMuon's second-moment scaling approach. Since Parallel Muon is already proven, this is an incremental variant with plausible but unquantified bpb gains at this training scale.	11.0	B	No	3/31/2026
Test-Time Training with MCTS-style tree search for optimal test-time compute allocation. TTT is the strongest validated technique in the competition (PR #1143 achieved 1.08056553 bpb, Δ-0.1438), but it is already on the leaderboard so the novelty score is low. The MCTS/UCB variant described here could offer incremental improvement over existing TTT implementations.	11.0	B	No	3/31/2026
Adds SLOT (Sample-specific LM Optimization at Test-time) as eval-time adaptation on top of existing legal TTT, achieving 1.11512 bpb. No training changes required, making it highly implementable. However eval times of 568-572s leave almost no margin against the 600s cap, creating reliability risk.	11.0	B	No	3/31/2026
Reference implementation of BigramHash embedding from the parameter-golf records, already listed as a proven leaderboard technique. The code is compact and well-documented, making it useful as an implementation reference, but offers zero novelty since BigramHash is already deployed by multiple competitors.	11.0	B	No	3/31/2026
Leaderboard snapshot revealing 1-bit quantization entry (1.1239 bpb, 106M params) on unlimited compute track. Goes beyond existing ternary/int6 quantization and achieves extreme compression, opening a new direction. However, 2hr training time and unlimited compute track mean time compatibility is uncertain for the 600s budget.	11.0	B	No	3/31/2026
MUD (MomentUm Decorrelation) claims 10-50% wall-clock improvement over tuned Muon by dramatically reducing per-step optimizer overhead while maintaining competitive convergence. Directly applicable to the 600s training budget — faster steps mean more training iterations or capacity for a larger model. Completely novel optimizer not on the leaderboard.	11.0	B	No	3/30/2026
SLOT eval-time augmentation gives -0.0008 bpb improvement. Orthogonal to training techniques. Free at inference time. Small but consistent gain.	11.0	B	No	3/30/2026
Original Cautious Weight Decay paper on OpenReview. Proposes modifying weight decay to only decay parameters whose gradient aligns with the decay direction. Simple to implement (~5-10 lines) and used in modded-nanogpt upstream, but not yet validated in parameter-golf by any competitor.	10.0	B	No	3/31/2026
Polar Express replaces Muon's Newton-Schulz iterations with an optimized matrix sign method for computing polar decompositions. Primary benefit is computational efficiency, not direct bpb improvement — could free time budget for more training steps or larger models. Paper provides reference code but integration into the existing Muon implementation requires care.	10.0	B	No	3/31/2026
Meta-analysis mining 975 training runs reveals step-1000 BPB correlates 0.86 with final BPB, enabling cheap early-stopping heuristics for hyperparameter search. Indirect benefit — does not itself reduce BPB but dramatically reduces the cost of exploring the hyperparameter space. Novel research direction for this competition that could be implemented as an early-stopping criterion.	10.0	B	No	3/31/2026
Novel 3-bit quantization using FWHT rotation to spread weight outliers before ternary coding. Could dramatically reduce artifact size (3-bit vs current int6), freeing budget for larger models or additional components. The rotation-domain smoothing before quantization is a new angle, though ternary quantization itself is already proven in the challenge.	10.0	B	No	3/31/2026
Directly relevant Parameter Golf technique: per-row MSE-optimal clip search for quantization, testing 5 clip values per weight row and keeping the one with lowest reconstruction error. Claims Top-3 on leaderboard. Technique is a refinement of existing quantization approaches (Full GPTQ, int6 QAT already proven) but the per-row MSE search is a concrete, implementable improvement to weight compression.	10.0	B	No	3/31/2026
Novel Magma (Momentum-Aligned Gradient Masking) optimizer wrapper from Joo et al. 2026 paper, applied as a drop-in on NorMuon. Computes gradient-momentum cosine similarity and stochastically skips block updates, inducing curvature-dependent implicit regularization with zero additional memory overhead. Highly implementable as it wraps the existing Muon optimizer already used in competition, and opens a genuinely new optimizer search direction not explored by any competitor.	10.0	B	No	3/31/2026
Direct competitor submission achieving 1.1116 bpb (3-seed mean) with Fused Triton MLP, Full Hessian GPTQ, Coprime-stride loader, XSA-all, and BigramHash 2816. All constituent techniques are already proven on the leaderboard, providing zero novelty, but the result validates this specific combination and the Triton MLP kernel saves 1.8ms/step. Custom Triton kernels add implementation complexity.	10.0	B	No	3/31/2026
Single-line replacement of ReLU² with LeakyReLU(0.5)² showing -0.0875 bpb improvement on 2×RTX 5090. Trivially implementable and already validated on the leaderboard as a proven technique. Provides additional confirmation on different hardware but offers zero novelty for advancing beyond current SOTA.	10.0	B	No	3/31/2026
GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048). The merged SOTA baseline that most top entries build on. ValueResidual technique originates here.	10.0	B	No	3/30/2026
1.1349 bpb. 11L U-Net + BigramHash + SmearGate + Partial RoPE + QAT. Solid baseline stack combining all proven techniques. U-Net architecture variant is interesting but doesn't beat simpler 11L.	10.0	B	No	3/30/2026
1.1311 bpb. 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480. Tests LoRA-based TTT (only adapting low-rank projections at test time). Moderate improvement over non-TTT baseline.	10.0	B	No	3/30/2026
1.1194 bpb. Batched Muon + Full GPTQ with random calibration + extensive JEPA research. JEPA negative result: 20+ hours and 14 ablations proved JEPA hurts at 27M/600s scale. The GPTQ random calibration approach is useful.	10.0	B	No	3/30/2026
ChatPaper summary of NorMuon highlighting the key insight about neuron-level adaptive learning rates via row-wise normalization. Redundant with the arXiv and HuggingFace sources but provides a concise summary of the mechanism.	9.0	B	No	3/31/2026
Scylla (novel tokenizer) combined with Legal Score-First TTT achieves 1.0806 BPB (3-seed mean), beating current SOTA by ~0.029 BPB. Extremely strong result but already listed in competitor submissions, requires a custom tokenizer and test-time training logic that is complex to implement (likely >100 lines) and may strain size/time budgets. High competitor validation but low novelty for our purposes.	9.0	B	No	3/31/2026
TTT-Linear + FlowRefiner at 1.1347 BPB. TTT is strongly validated by the #1 competitor (PR #672, 1.0781 BPB), making it the most impactful known technique. TTT-Linear is a lightweight variant and FlowRefiner is a novel refinement mechanism worth investigating. The 256-bpb gap to SOTA suggests implementation headroom remains.	9.0	B	No	3/31/2026
TTT-Linear with a novel 1-step flow matching refiner achieves 1.1347 val_bpb on 2×A100 but takes 2.2 hours — a hard time constraint violation. The FlowRefiner concept is genuinely novel and the bpb is in the competitive range, but adapting to 8×H100 within 600s is a major open question.	9.0	B	No	3/31/2026
Trivial env-knob-only submission setting WARMDOWN_ITERS=900. Val_bpb of 1.564 is catastrophically far from SOTA, indicating the submission lacks nearly all competitive optimizations (no XSA, no BigramHash, no TTT, no int6). Warmdown scheduling is already a known technique.	9.0	B	No	3/31/2026
SpotlightLFB (one-site late feature bank) is a novel architectural component with a co-designed export path using hybrid int6/int8+lzma compression. The 1.1493 bpb is not competitive with top entries but the technique opens a new direction. Fits time and size constraints comfortably.	9.0	B	No	3/31/2026
74.3M param Ternary U-Net Transformer v2 achieving 1.1539 bpb with BF16 scale storage and EMBED_DIM=312 improvements. Behind current SOTA by ~0.045 bpb. Ternary quantization is already a proven technique; the U-Net architecture direction has been explored but not competitively validated at top tier.	9.0	B	No	3/31/2026

1 / 12