🤖 AI Summary
Current AI agents struggle to execute end-to-end AI research experiments—including hypothesis generation, experimental design, code implementation, and result analysis. Method: We introduce EXP-Bench, the first benchmark targeting the full scientific workflow, comprising 461 realistic tasks derived from 51 top-tier conference papers. We propose a semi-autonomous pipeline for experiment-detail extraction, formally define and quantify agents’ capability to execute runnable AI experiments end-to-end, and establish a structured process model with a multi-granularity scoring system. Results: Experiments reveal severe limitations: state-of-the-art LLM-based agents (e.g., OpenHands) achieve only 0.5% success rate on complete experiments, with subtask accuracy peaking at 20–35%, exposing fundamental bottlenecks in coherence, robustness, and cross-step reasoning. EXP-Bench is open-sourced, providing foundational evaluation infrastructure for AI-for-AI research.
📝 Abstract
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.