EXP-Bench: Can AI Conduct AI Research Experiments?

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current AI agents struggle to execute end-to-end AI research experiments—including hypothesis generation, experimental design, code implementation, and result analysis. Method: We introduce EXP-Bench, the first benchmark targeting the full scientific workflow, comprising 461 realistic tasks derived from 51 top-tier conference papers. We propose a semi-autonomous pipeline for experiment-detail extraction, formally define and quantify agents’ capability to execute runnable AI experiments end-to-end, and establish a structured process model with a multi-granularity scoring system. Results: Experiments reveal severe limitations: state-of-the-art LLM-based agents (e.g., OpenHands) achieve only 0.5% success rate on complete experiments, with subtask accuracy peaking at 20–35%, exposing fundamental bottlenecks in coherence, robustness, and cross-step reasoning. EXP-Bench is open-sourced, providing foundational evaluation infrastructure for AI-for-AI research.

Technology Category

Application Category

📝 Abstract

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to conduct complete AI research experiments

Assessing capabilities in hypothesis formulation, experiment design, and analysis

Identifying bottlenecks in executing end-to-end AI research tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-autonomous pipeline extracts experimental details

Benchmark evaluates AI agents on complete experiments

Structured tasks from top-tier AI research papers

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?