RExBench: Can coding agents autonomously implement AI research extensions?

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper investigates whether large language model (LLM)-based coding agents can autonomously execute scalable AI research tasks. Method: We introduce RExBench—the first benchmark designed to evaluate research extension capabilities—comprising 12 expert-crafted, unrealized hypotheses derived from real papers and codebases, with built-in resistance to data contamination, automated execution, and objective evaluation. Our agent framework integrates aider, Claude Code, and OpenHands to form a closed-loop pipeline: instruction understanding → code generation → execution feedback. Contribution/Results: Experiments across nine state-of-the-art agents show near-universal failure under zero-shot prompting; even with human-authored prompts, peak success rates remain below 40%. These findings reveal a substantial gap between current LLM agents and true autonomous scientific research, while establishing RExBench as a rigorous, reproducible benchmark and empirical foundation for advancing AI-augmented research.

Technology Category

Application Category

📝 Abstract

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous AI research extension implementation by coding agents

Assessing LLM agents' capability to extend and implement research experiments

Benchmarking performance of coding agents on realistic research tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agents for research extensions

RExBench for autonomous implementation evaluation

Automatic execution of agent outputs

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery