Skip to content
robinSenior Software Engineer - Applied AI
All articlesParameter Golf Auto Research

Parameter Golf Auto Research

27 Mar 2026 · 9 min read · 1,608 words

Let me be upfront: I am not an ML researcher. I'm a software engineer. My day job at Peloton mostly includes building web products and making sure things don't fall over at 3am. I know what a transformer is. I could not derive one from scratch. If you put me in a room with actual ML researchers competing, it wouldn't be close.

So naturally, I decided to enter.

The Challenge, Briefly

Parameter Golf is deceptively simple. Train the best language model you can, but the entire artifact, code plus compressed weights, must fit in 16MB. Training is capped at 10 minutes on 8×H100 SXMs. No network calls during evaluation. The metric is bits per byte on FineWeb. Lower is better.

The current SOTA sits at 1.1194 bpb, achieved through int6 quantization, parameter banking, test-time training, and a custom bigram tokenizer, all stuffed inside that 16MB envelope. The leaderboard moves fast. Someone ships a new technique, the SOTA drops, and if you blink you've spent compute validating ideas that have already been done.

My first real attempt landed at 1.148 bpb. I was immediately outplayed. This is how the competition works.

The Honest Assessment

If I were competing purely on ML intuition, I'd lose. Full stop. The people at the top of this leaderboard understand the architecture decisions at a level I don't. They know which quantization schemes compose well, which attention variants are worth the parameter cost, and which training tricks are voodoo dressed up as science. I have none of that accumulated instinct.

What I do have is experience building agentic flows that are stable, reliable and can improve themselves over time. At first I was brute forcing my solutions which got expensive quickly. Thankfully RunPod & OpenAI awarded me $500 for my efforts to continue testing my solutions, that's when I took a step back and decided to completely rethink my approach, extending Karpathy's autoresearch paradigm.

The plan is not to out-research the researchers. The plan is to build something that out-researches them on my behalf.

Karpathy's Loop, With More Plumbing

Andrej Karpathy published autoresearch in early March. The idea is straightforward: give an agent a training script, let it modify the code, run a 5-minute experiment, check if the result improved, keep or discard, and repeat. About 12 experiments per hour on a single GPU. You wake up to a log of outcomes instead of a queue of things to run manually.

Karpathy's core loop:

Loading diagram...

The problem is that autoresearch wasn't designed for Parameter Golf's specific constraints, and those constraints turn what looks like an ML optimisation problem into a systems engineering problem.

So I'm building my own: Parameter Golf Auto Research.

The Architecture: Two Agents, One Supervisor

The system runs as three processes. A thin process supervisor (orchestrate.py) spawns two independent Claude Code (Opus 4.6) agents, one for experiments, one for research, and manages the infrastructure lifecycle. Neither agent blocks the other. They communicate via shared JSONL files. The supervisor has no LLM logic of its own; it monitors agent health (restarting on crash, up to 5 attempts), and polls promotion_queue.jsonl for Tier 2 promotion requests.

Splitting the two concerns, running experiments and discovering new techniques, means the research pipeline can be updating its graded cache while an experiment is mid-run. It also means the research agent can respond reactively to experiment agent requests via research_queue.jsonl, not just drive its own search cadence on a fixed timer.

The Three Constraints That Make This a Systems Problem

The 16MB artifact limit is the first one that changes everything. You can't optimise for loss in isolation; every architectural change has to be weighed against its compressed footprint. A technique that improves val_bpb by 2% but pushes the artifact over 16MB is worthless. That's not an ML judgement; it's a constraint satisfaction problem. Before any training run, the agent checks artifact size as a hard gate, not a warning. An exit code 1 is non-negotiable.

The second constraint is cost. Official runs on 8×H100 SXMs run at roughly $20/hour. A 10-minute training window costs about $3.33 at minimum; pod startup and sync overhead push the real number closer to $3.50. You can't loop freely at that price, which means you need a cheap proxy to filter experiments before spending money on them. My answer is a two-tier compute model: MLX on Apple Silicon as the free scratchpad, RunPod 8×H100 as the expensive printer.

Loading diagram...

Checking feasibility before writing a single line of code is what prevents the most expensive mistakes.

$ python orchestrate.py, check-constraints, params 23000000, bits 6, code-bytes 30000

Constraint Report
────────────────────────────────────────────────────────────
Parameters:      23,000,000
Bit-width:       6
Code bytes:      30,000

Artifact size:   ~17.25MB raw → ~13.8MB zstd   ✅ FEASIBLE (< 16MB)
Training steps:  ~18,400 steps at batch=64       ✅ OK within 600s
Quant MSE floor: 0.0031                          ✅ Acceptable
Entropy bound:   zstd can compress to ~13.4MB    ✅ FEASIBLE

Verdict: FEASIBLE, proceed to implementation

If the report says NOT FEASIBLE, the idea is mathematically doomed and the agent redesigns before a training run is wasted. Five hundred local iterations costs nothing and takes a few minutes. If the local val_bpb doesn't drop at least by a dynamic threshold (3% by default, relative to the running baseline), the experiment doesn't get promoted. This threshold is dynamic; as our score improves and closes in on SOTA, we accept smaller and smaller margins of improvement.

Every hypothesis also passes through five deterministic hard gates before any code is written or compute is spent: a constraint check (artifact size, training steps, quantization MSE, entropy bounds, memory footprint), a contamination check (AST analysis for validation data leakage), a critic gate (diff size, similarity to past failures), a dynamic promotion threshold, and a budget check. The agent cannot override any of these.

The third constraint is that the competition is a moving target. An agent that only sees its own code will miss techniques that other competitors publish mid-challenge. The leaderboard is public. The PRs are public. OpenAI explicitly encourages building on top of each other's ideas. If someone ships a new quantization scheme that drops bpb by 0.05, I want to know before I spend three runs rediscovering it. So the research pipeline watches openai/parameter-golf PRs, KellerJordan/modded-nanogpt, ArXiv, OpenReview, Semantic Scholar, Tavily web search, and several other sources, ten in total, in parallel, with the research agent driving its own cadence and responding reactively to experiment requests.

The Part That Actually Interests Me

The research grading is where the software engineering instinct matters most. The naive approach is to score papers by citation count or abstract quality. That's useless here. A paper about a technique that requires a new pip dependency, or that would push the artifact over 16MB, or that needs 20 minutes of training time, scores zero regardless of how good the underlying idea is.

There's also a lightweight pre-filter before any LLM tokens are spent: the system tries to extract parameter counts and bit-widths from titles and abstracts using regex. If both are extractable and the numbers are infeasible, the item is auto-rejected. Only items that survive that pass get scored by the model.

The grading prompt is built dynamically from experiment history: what's the current SOTA, what techniques are already proven, what has already failed. A paper about an approach that failed two weeks ago gets penalised. A confirmed improvement adds its technique to the proven list, which penalises re-implementation.

After grading and verification, a reflection cycle synthesises experiment history into strategic guidance, identifying failure patterns, exhausted vs. promising search dimensions, and recommending next experiments. The output goes into strategy.md and gets injected into the agent's working context alongside raw experiment data. Alongside that, the system maintains a technique adjacency map (technique_map.json), a graph of technique relationships labelled as proven, exploring, dead end, or untried. The agent gets a structured view of where the search space has been, not just a flat list.

Before Tier A items proceed to expensive verification, the research agent can also run a micro-experiment: apply a diff to a temporary copy of the training script, run 50 iterations on synthetic data (~15 seconds on an M4 MacBook Air), and check whether the code crashes, loss decreases, and artifact size is still legal. It won't catch everything, but it kills syntax errors, import failures, and NaN divergence before they waste a full experiment slot.

There's also a tournament mode: instead of betting on the first plausible hypothesis, generate four candidate modifications, run each for 100 iterations in an elimination round, advance the top two to a full 500-iteration run, and report the winner. When the search space feels wide, it's more efficient than sequential testing.

This is the part of the problem that feels familiar. It's just a feedback loop with a well-defined metric, which is the same thing I'd build for any system I wanted to improve without watching it constantly.

Follow Along

I'm publishing a live dashboard at robinw.co.uk/pgolf where you can follow the agents' progress in real time; best bpb, distance to SOTA, artifact headroom, budget remaining, recent experiments, and the research pipeline's latest findings. It'll be empty until the first run kicks off, but it updates as the system runs.

What I Don't Know

I've never trained a language model that anyone cared about. I'm genuinely uncertain how much the MLX-to-PyTorch translation gap will hurt me, architecture changes that work locally don't always transfer cleanly to multi-GPU at scale, and diagnosing why is expensive when each diagnostic costs $3.50.

Why Do This at All

Partly because the problem is genuinely interesting, it's one of the few public benchmarks where the constraint set is specific enough to make engineering choices matter as much as model knowledge. Partly because I'm curious whether a systems-first approach can actually compete with ML-first intuition in a problem this constrained. And partly for fun: nothing like a bit of competition to get creative and come up with new ways to solve problems.

I'm kicking off the first run Monday the 30th of March. Hope to see you all on the leaderboard ✌️.