The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the capability bottlenecks of AI agents in scientific reproducibility tasks, particularly for rapidly reproducing LLM training optimizations (e.g., based on NanoGPT). Method: We introduce the first automated benchmark grounded in a “speed race” paradigm, comprising 19 executable, hardware-aware algorithm reproduction tasks. The benchmark integrates multi-level prompts—including pseudocode and paper-style descriptions—and leverages state-of-the-art reasoning-oriented LLMs and frameworks for end-to-end code generation and optimization. Contribution/Results: It establishes the first non-saturating quantitative evaluation of AI-driven scientific automation. Empirical results reveal that even top-tier reasoning LLMs, under strong prompting, fail to reliably reproduce well-documented training optimizations—exposing critical limitations in algorithmic comprehension, system-level co-design, and scientific reasoning. This highlights a fundamental gap between current LLM capabilities and the rigorous demands of autonomous AI research.

Technology Category

Application Category

📝 Abstract
Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to reproduce NanoGPT speedrun improvements
Assessing LLMs' capacity to automate scientific reproduction tasks
Measuring performance in reimplementing known training optimizations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark for LLM speedrunning tasks
Leverages NanoGPT speedrun competition data
Evaluates AI agents on code-level optimizations
🔎 Similar Papers
No similar papers found.
Bingchen Zhao
Bingchen Zhao
University of Edinburgh
Artificial IntelligenceKnowledge Discovery
Despoina Magka
Despoina Magka
University of Oxford, Department of Computer Science
Artificial intelligenceKnowledge representation and reasoningLogic
M
Minqi Jiang
Meta
X
Xian Li
Meta
Roberta Raileanu
Roberta Raileanu
Research Scientist at Google DeepMind, Honorary Lecturer at UCL
Artificial IntelligenceReinforcement LearningDeep LearningOpen-Ended Learning
Tatiana Shavrina
Tatiana Shavrina
Meta
Natural language processingcomputational linguisticsbenchmarkingmultilinguality
J
Jean-Christophe Gagnon-Audet
Meta
K
Kelvin Niu
Meta
Shagun Sodhani
Shagun Sodhani
Google DeepMind
Machine LearningReinforcement LearningLifelong Learning
Michael Shvartsman
Michael Shvartsman
Research Scientist, Meta Reality Labs Research
Computational cognitive science and machine learning for neuroscience
Andrei Lupu
Andrei Lupu
University of Oxford & FAIR, Meta AI
Reinforcement LearningMulti-Agent RL
Alisia Lupidi
Alisia Lupidi
University of Cambridge
Edan Toledo
Edan Toledo
Meta & UCL
Reinforcement LearningNatural Language ProcessingMulti Agent Reinforcement Learning
Karen Hambardzumyan
Karen Hambardzumyan
FAIR, Meta + University College London
InterpretabilityNatural Language ProcessingFew-Shot Learning
Martin Josifoski
Martin Josifoski
Meta
Thomas Foster
Thomas Foster
University of Oxford
L
Lucia Cipolina-Kun
Meta
A
Abhishek Charnalia
Meta
D
Derek Dunfield
Meta
A
Alexander H. Miller
Meta
Oisin Mac Aodha
Oisin Mac Aodha
Reader (Associate Professor), University of Edinburgh
Computer VisionMachine LearningMachine TeachingActive LearningConservation Technology
Jakob Foerster
Jakob Foerster
Associate Professor, University of Oxford
Artificial Intelligence
Yoram Bachrach
Yoram Bachrach
Meta (FAIR)
Artificial IntelligenceMachine LearningMultiagent Systems