🤖 AI Summary
This work addresses the capability bottlenecks of AI agents in scientific reproducibility tasks, particularly for rapidly reproducing LLM training optimizations (e.g., based on NanoGPT).
Method: We introduce the first automated benchmark grounded in a “speed race” paradigm, comprising 19 executable, hardware-aware algorithm reproduction tasks. The benchmark integrates multi-level prompts—including pseudocode and paper-style descriptions—and leverages state-of-the-art reasoning-oriented LLMs and frameworks for end-to-end code generation and optimization.
Contribution/Results: It establishes the first non-saturating quantitative evaluation of AI-driven scientific automation. Empirical results reveal that even top-tier reasoning LLMs, under strong prompting, fail to reliably reproduce well-documented training optimizations—exposing critical limitations in algorithmic comprehension, system-level co-design, and scientific reasoning. This highlights a fundamental gap between current LLM capabilities and the rigorous demands of autonomous AI research.
📝 Abstract
Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.