The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the capability bottlenecks of AI agents in scientific reproducibility tasks, particularly for rapidly reproducing LLM training optimizations (e.g., based on NanoGPT). Method: We introduce the first automated benchmark grounded in a “speed race” paradigm, comprising 19 executable, hardware-aware algorithm reproduction tasks. The benchmark integrates multi-level prompts—including pseudocode and paper-style descriptions—and leverages state-of-the-art reasoning-oriented LLMs and frameworks for end-to-end code generation and optimization. Contribution/Results: It establishes the first non-saturating quantitative evaluation of AI-driven scientific automation. Empirical results reveal that even top-tier reasoning LLMs, under strong prompting, fail to reliably reproduce well-documented training optimizations—exposing critical limitations in algorithmic comprehension, system-level co-design, and scientific reasoning. This highlights a fundamental gap between current LLM capabilities and the rigorous demands of autonomous AI research.

Technology Category

Application Category

📝 Abstract

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to reproduce NanoGPT speedrun improvements

Assessing LLMs' capacity to automate scientific reproduction tasks

Measuring performance in reimplementing known training optimizations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark for LLM speedrunning tasks

Leverages NanoGPT speedrun competition data

Evaluates AI agents on code-level optimizations

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Scientist, AI Language