Skip to content
robinSenior Software Engineer - Applied AI
All articlesParameter Golf Auto Research

Parameter Golf Auto Research

27 Mar 2026 · 11 min read · 2,195 words

OpenAI's Parameter Golf challenge is deceptively simple to describe: train the best language model you can, but the entire artifact code plus compressed weights must fit in 16MB, training is capped at 10 minutes on 8×H100 SXMs, and the model can't make any network calls during evaluation. The metric is bits per byte on FineWeb. Lower is better.

The current SOTA sits at 1.1194 bpb. Getting there required int6 quantization, parameter banking, test-time training, and a custom bigram tokenizer, all packed inside that 16MB envelope. The leaderboard moves fast. Someone ships a new technique, the SOTA drops, and if you're not watching the repo you'll spend compute validating ideas that are already on the board.

I wanted to pull in my own experience in retrieval, agentic loops and context management to build something autonomous that would self improve over time. Not "run a script overnight" actually autonomous, where an agent can gather research, form hypotheses, run experiments, and decide what to promote to expensive hardware without me in the loop. Karpathy's autoresearch showed the core loop works: modify a training script, train, evaluate, keep or revert, repeat. About 12 experiments per hour on a single GPU. But Parameter Golf adds constraints that autoresearch wasn't designed for, and those constraints turn what looks like an ML problem into a systems problem.

Introducing Parameter Golf Auto Research 🏌️‍♂️

The Constraints That Make This Hard

The 16MB artifact limit is the first constraint that changes everything. You can't just optimize for loss- every architectural change has to be weighed against its compressed footprint. A technique that improves val_bpb by 2% but pushes the artifact over 16MB is worthless. So before any training run, the agent checks:

python measure_artifact.py

That script measures train_gpt.py code bytes plus zstd-compressed weights, and exits with code 1 if the total exceeds 16,000,000 bytes. It's a hard gate, not a warning.

def measure_artifact(train_script: str = "train_gpt.py") -> int:
    script_path = Path(train_script)
    total = script_path.stat().st_size

    weight_files = list(Path(".").glob("*.npz")) + list(Path(".").glob("*.pt"))
    if weight_files:
        zstd_module = _try_import_zstandard()
        for weight_path in weight_files:
            total += _measure_weight_file(weight_path, zstd_module)

    headroom = ARTIFACT_LIMIT - total
    status = "OK" if total <= ARTIFACT_LIMIT else "OVER LIMIT"
    print(f"artifact_bytes: {total}")
    print(f"headroom: {headroom}")
    print(f"status: {status}")
    return total

The second constraint is cost. The official runs require 8×H100 SXMs at roughly $20/hour. A 10-minute training window costs about $3.33 at minimum; pod startup and rsync overhead bring the real number closer to $3.50. You can't loop freely at that price. You need a way to filter experiments before spending money on them.

The third constraint is that the competition is a moving target. An agent that only sees its own code will miss techniques that other competitors publish mid-challenge. The leaderboard is public. The PRs are public. If someone ships a new quantization scheme that drops bpb by 0.05, you want to know about it before you spend three runs rediscovering it.

These three constraints- artifact size, compute cost, and competitive intelligence- are what this project is actually about. The model architecture is almost secondary.

Two Tiers of Compute

The solution to the cost problem is a two-tier compute model. MLX on Apple Silicon is the free scratchpad. RunPod 8×H100 is the expensive printer.

Loading diagram...

Every experiment starts locally. The agent runs 500 iterations of train_gpt_mlx.py on Apple Silicon- fast, free, directional. The local val_bpb isn't the challenge score; it's a signal. If it drops at least 3% relative to the running local baseline, the commit qualifies for promotion to RunPod. Below that threshold, the experiment either gets kept if it simplified the code or discarded.

The 3% threshold is a gating mechanism, not a magic number. It's calibrated to filter out noise while catching real improvements. As the system accumulates paired local/RunPod results, get_tier_correlation() tracks the average delta between the two tiers- how much the local signal over- or under-predicts the actual RunPod result. That correlation gets injected back into program.md so the agent can adjust its intuitions about what local improvements are worth promoting.

The budget math is straightforward. With TOTAL_COMPUTE_CREDITS=500 and RUNPOD_MIN_RESERVE=50, there are roughly 128 Tier 2 runs before hitting the reserve floor. The one-run-per-hour rate limit means a continuous session can't burn through that in less than five days. Both constraints persist across process restarts in budget.json.

def can_submit(self) -> tuple[bool, str]:
    remaining = self.total_credits - self.spent
    if remaining < self.min_reserve:
        return (
            False,
            f"Blocked: remaining ${remaining:.2f} is below reserve ${self.min_reserve:.2f}",
        )

    if self._is_rate_limited():
        return False, "Rate limited: less than 1 hour since last Tier 2 run"

    warning_threshold = self.min_reserve * _RESERVE_WARNING_MULTIPLIER
    if remaining < warning_threshold:
        return (
            True,
            f"Warning: remaining ${remaining:.2f} is below 2x reserve ${warning_threshold:.2f}",
        )

    return True, "OK"

The cost calculation uses actual wall-clock duration: (seconds / 3600) * gpu_count * (hourly_rate / 8). This means a run that finishes in 8 minutes costs less than one that hits the 10-minute cap, and the budget reflects that accurately.

Pod Lifecycle Safety

At $20/hour for 8×H100s, a hung pod costs $0.33/minute to forget about. The orchestrator can crash. The SSH connection can drop. The Python process can get killed. Any of these leaves an active pod running and billing.

The fix is two lines in RunPodClient.__init__:

atexit.register(self._cleanup_all)
signal.signal(signal.SIGTERM, lambda signum, frame: self._cleanup_all())

_cleanup_all iterates over self._active_pods and terminates each one. It runs on normal exit, on SIGTERM, and on any unhandled exception that propagates to the top level. The set of active pods is maintained in memory- if the process is killed with SIGKILL there's no recovery, but that's an acceptable edge case. The common failure modes (crash, keyboard interrupt, deployment restart) are all covered.

This is the kind of thing that's obvious in retrospect and expensive to learn the hard way.

The Research Pipeline

The research loop runs every 6 hours. Seven sources get queried in parallel via asyncio.gather: ArXiv, OpenReview, Semantic Scholar, GitHub PRs (watching openai/parameter-golf, KellerJordan/modded-nanogpt, and karpathy/autoresearch), RSS feeds, three Tavily scheduled search modes, and GitHub code search. A separate daemon thread hits Tavily's breaking news endpoint every hour, which catches competition-specific preprints and blog posts that wouldn't show up in a standard batch crawl.

results = await asyncio.gather(
    fetch_arxiv(since_hours),
    fetch_openreview(since_hours),
    fetch_semantic_scholar(since_hours),
    fetch_github_prs(since_hours),
    fetch_feeds(since_hours),
    fetch_tavily_scheduled(),
    fetch_github_code_search(),
    return_exceptions=True,
)

The return_exceptions=True matters. If Semantic Scholar is down or the GitHub rate limit kicks in, the other sources still complete. Failed sources log their errors and the pipeline continues.

After deduplication and a Tavily relevance threshold filter (items scoring below 0.4 get dropped), new items go to raw_cache.jsonl. Then they get graded.

Grading Research Against the Challenge's Actual Constraints

The grading step is where most research pipelines stop being interesting. The naive approach is to score papers by citation count or abstract quality. That's useless here. A paper about a technique that requires a new pip dependency, or that would push the artifact over 16MB, or that needs 20 minutes of training time, scores zero regardless of how good the underlying idea is.

The grader uses five dimensions: bpb_impact (0-3), size_compatibility (0-3), time_compatibility (0-2), implementability (0-4), and novelty (0-3). Total out of 15. Items scoring 10+ are Tier A; 7-9 are Tier B; below 7 are Tier C.

What makes this interesting is that the grading prompt is built dynamically from experiment history:

def _build_grading_prompt() -> str:
    """Build the grading prompt dynamically from experiment history."""
    current_best = get_current_best_bpb()
    proven = get_proven_techniques()
    failed = get_failed_experiments()
    competitors = get_competitor_scores()

    proven_text = ", ".join(proven)

    if failed:
        failed_lines = []
        for exp in failed[-_MAX_FAILED_IN_PROMPT:]:
            failed_lines.append(
                f"- {exp['description']} (val_bpb={exp['val_bpb']}, tier={exp['tier']})"
            )
        failed_section = (
            "\n## FAILED EXPERIMENTS (penalize re-attempts of these approaches):\n"
            + "\n".join(failed_lines)
            + "\n"
        )
    else:
        failed_section = ""

    has_competitors = bool(competitors)
    if has_competitors:
        comp_lines = []
        for comp in competitors[:_MAX_COMPETITORS_IN_PROMPT]:
            comp_lines.append(
                f"- PR #{comp['pr_number']} by @{comp['author']}: "
                f"{comp['technique']}{comp['val_bpb']} bpb "
                f"(Δ{comp['delta_from_baseline']:+.4f})"
            )
        competitor_section = (
            "\n## COMPETITOR SCORES (what others achieved and with what techniques):\n"
            + "\n".join(comp_lines)
            + "\n"
        )
    else:
        competitor_section = ""
    ...

The SOTA reference updates automatically from results.tsv- the lowest val_bpb from tier=runpod, status=keep rows. Proven techniques merge the static baseline list with any technique from a promoted RunPod experiment. Failed experiments get injected to penalize re-attempts. Competitor scores add a sixth dimension (competitor_validated, 0-2) when available.

This means the grading prompt on day 10 of running the system is substantially different from the grading prompt on day 1. It knows what you've tried, what failed, what competitors have shipped, and what the current bar is. A paper about BigramHash embeddings that would have scored 12/15 on day 1 scores much lower once BigramHash is already in the proven techniques list.

Competitor Intelligence

OpenAI openly encourages building on top of each others ideas, early on in the competition I submitted a val_bpb of 1.148 only to be immediately outplayed, that's what the competition is for, constant agressive improvement. The GitHub PR source does something specific for openai/parameter-golf: it fetches the records/*/README.md files from each PR, which is where competitors document their val_bpb scores and techniques. Then it runs regex extraction to pull the actual numbers:

_RE_VAL_BPB_BOLD = re.compile(r"\*{0,2}val_bpb:\s*(\d+\.\d+)\*{0,2}", re.IGNORECASE)
_RE_VAL_BPB_EQ = re.compile(r"val_bpb\s*[=:]\s*(\d+\.\d+)", re.IGNORECASE)
_RE_BPB_BARE = re.compile(r"(?
[Auto-injected by inject.py- do not manually edit this section]


## Experiment History

[Auto-injected by inject.py- do not manually edit this section]


## Competitor Scores

[Auto-injected by inject.py- do not manually edit this section]


## Verified Research (deep-analyzed)

[Auto-injected by verify.py- Tier A items re-evaluated with full content + web verification]

Every 6 hours, inject_into_program_md() rewrites all four sections. The research context gets the top 12 graded items by score, with a character budget of 12,000 characters (3,000 tokens × 4) to prevent context overflow. The experiment history gets the last 8 runs from results.tsv with their val_bpb, status, and cost. The competitor scores render as a markdown table. The verified section shows the top 5 verified items with their score changes and implementation briefs.

The injection uses regex replacement against HTML comment markers:

new_content = re.sub(
    r".*?",
    replacement,
    content,
    flags=re.DOTALL,
)

There's also a dynamic baseline update that rewrites the SOTA line in the Metric section:

new_content = re.sub(
    r"\*\*SOTA: [\d.]+ bpb\. Baseline: 1\.2244 bpb\.\*\*",
    f"**SOTA: {current_best} bpb. Baseline: 1.2244 bpb.**",
    content,
)

The agent always sees the current SOTA, not a stale number from when the file was first written.

The Feedback Loop

The most important architectural decision was closing the loop between experiment outcomes and research selection. The initial version was open-loop: research got fetched, graded, and injected, but the grading prompt never knew what the agent had actually tried. Papers about techniques that had already failed would keep scoring well and keep getting injected.

The closed loop works like this:

Loading diagram...

results.tsv is the audit trail and the feedback signal. Every row has a commit hash, tier (local or runpod), val_bpb, artifact_bytes, status (keep/discard/crash), and cost. get_current_best_bpb() reads the minimum val_bpb from tier=runpod, status=keep rows. get_proven_techniques() extracts technique names from promoted RunPod experiments and merges them with the static baseline list. get_failed_experiments() returns the last 10 discarded or crashed experiments.

All three feed into _build_grading_prompt() on every grading cycle. The grading prompt is never static. It reflects the actual state of the experiment history.

This matters more than it sounds. Most AI systems that claim to be "learning" are actually open-loop- they generate outputs, but those outputs don't feed back into how they generate future outputs. The research pipeline here is genuinely closed: a failed experiment reduces the score of similar papers in the next grading cycle. A confirmed RunPod improvement adds the technique to the proven list, which penalizes re-implementation. The system gets incrementally smarter about what's worth trying.

The MLX to PyTorch Translation Gap

One thing that took a few runs to understand: architecture changes that work in MLX don't always transfer cleanly to PyTorch at scale. The two training scripts are structurally similar but not identical. The MLX version uses separate Q, K, V projections; the PyTorch version uses a fused QKV linear for efficiency on multi-GPU:

MLX:

self.q_proj = nn.Linear(d_model, d_model)
self.k_proj = nn.Linear(d_model, d_model)
self.v_proj = nn.Linear(d_model, d_model)

PyTorch:

self.qkv = nn.Linear(d_model, d_model * QKV_FACTOR)

The PyTorch version also uses F.scaled_dot_product_attention with is_causal=True, which dispatches to FlashAttention on H100s. The MLX version implements attention manually. These differences mean that a change to the attention mechanism needs to be translated carefully, not just copied. The agent's instructions in program.md are explicit about this: "translate architecture changes from MLX to PyTorch after each promotion."

When a RunPod result is worse than the MLX signal predicted, the agent is supposed to investigate the translation. In practice, this is where the most interesting debugging happens- understanding why a technique that looked good locally doesn't scale.

What results.tsv Is and Why That's Enough

The audit trail is a tab-separated file with nine columns: commit, tier, val_bpb, artifact_bytes, memory_gb, status, promoted, cost_usd, description. Every experiment gets a row. Every RunPod run gets a row. The orchestrator appends to it automatically after each Tier 2 run.

commit  tier    val_bpb  artifact_bytes  memory_gb  status  promoted  cost_usd  description
local_bigramhash_1  local  1.1891  14200000  ...  keep  pending  0.00  add BigramHash(10240) embedding
runpod_a3f2b1c_0312  runpod  1.1743  14200000  ...  keep  yes  3.47  add BigramHash(10240) embedding

This is the simplest possible structure that answers the questions that matter: what did we try, what did it cost, did it work, and did we promote it? The get_tier_correlation() function reads paired local/runpod rows to compute the average delta between tiers. The get_experiment_history_bullets() function formats the last N rows as markdown for injection into program.md.

There's no database, no ORM, no query language. A TSV file that you can open in a spreadsheet or grep in a terminal. When you're running hundreds of experiments, the audit trail needs to be readable by a human who's debugging at 2am, not just by the system that wrote it.

What's next?

The most useful observation from building this is that most AI systems are open-loop pretending to be closed-loop. They have a feedback mechanism in theory- the agent "learns" from its outputs- but in practice the outputs don't actually change how the system selects future actions. The research pipeline here is a small example of what genuine feedback looks like: experiment outcomes directly modify the scoring function for future research items.

The second observation is about the value of cheap proxies. The 3% local threshold exists because running 500 MLX iterations takes a few minutes and costs nothing. Without that proxy, every hypothesis would need $3.50 to validate. With it, the agent can run 20 experiments locally for every one it promotes. The proxy isn't perfect- the MLX/PyTorch delta means some local improvements don't transfer- but it's good enough to filter out the obvious failures.

The third observation is about context management. program.md has a character budget for the research section (12,000 characters) because an agent with too much context performs worse than one with focused context. The top 12 items by score, with a hard character cap, is a deliberate choice. More items would dilute the signal. The verification step exists partly to compress high-value items into dense implementation briefs rather than verbose paper summaries.

What's still missing: an exploration frontier tracker that prevents the agent from clustering around a local optimum in technique space. The current system penalizes re-attempts of failed experiments, but it doesn't actively push toward unexplored directions. The grading calibration is also unchecked- there's no feedback loop between graded scores and actual experimental outcomes, so the grader could be systematically wrong about which dimensions matter most.

The competition is still running. The SOTA is still moving. The system runs, the research refreshes, the experiments accumulate in results.tsv, and the grading prompt gets a little more accurate with each cycle. That's the loop. It's not finished, but it's running, and the infrastructure is solid enough that I trust it to keep running without me watching it.