From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing AI agents lack a systematic benchmark for evaluating scientific experimentation capabilities across diverse code initialization scenarios—from scratch implementation to partial reproduction—hindering accurate assessment of their potential for automating research. Method: We introduce AutoExperiment, the first benchmark to quantitatively measure agents’ ability to implement and execute machine learning experiments solely from paper descriptions, using a progressive code masking mechanism that spans the full spectrum from code completion to full experimental reproduction. Our methodology integrates natural language understanding, multi-turn code generation, sandboxed execution, dynamic debugging, and result validation, evaluated via Pass@1 and Pass@5 metrics across multiple stages. Results: Experiments reveal a significant performance drop for current agents under zero-initialization conditions; agents equipped with environment interaction and iterative trial-and-error capabilities substantially outperform others, underscoring long-horizon closed-loop reasoning as a fundamental bottleneck in AI-driven scientific discovery.

Technology Category

Application Category

📝 Abstract

Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agents' ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in difficulty by varying the number of missing functions $n$, ranging from partial reproduction to full replication. We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases. Agents that can dynamically interact with the environment (e.g. to debug their code) can outperform agents in fixed "agentless" harnesses, and there exists a significant gap between single-shot and multi-trial success rates (Pass@1 vs. Pass@5), motivating verifier approaches to our benchmark. Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution, establishing AutoExperiment as a new benchmark for evaluating progress in AI-driven scientific experimentation. Our data and code are open-sourced at https://github.com/j1mk1m/AutoExperiment .

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to implement scientific ideas from varied code inputs

Assessing agents' performance in generating missing code for ML experiments

Measuring challenges in long-horizon code generation and autonomous execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive code masking for varied difficulty

Sandboxed environment for autonomous execution

Dynamic interaction improves debugging success

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery