ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates AI agents’ ability to reproduce core scientific contributions—experimental design, theoretical derivation, data analysis, and code implementation—in astrophysics research. Method: We introduce the first domain-expert-curated, full-paper-scale astrophysical reproduction benchmark, rigorously assessing agent fidelity and technical correctness via task decomposition, human-in-the-loop validation, and end-to-end reproduction evaluation. Contribution/Results: The benchmark uncovers diverse failure modes of current large language model–driven AI agents in scientific reproduction, with the best-performing model achieving less than 20% accuracy—revealing severe limitations in scientific rigor. Our work establishes a scalable, verifiable paradigm for evaluating AI reliability in data-driven scientific discovery.

Technology Category

Application Category

📝 Abstract
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents'reliability in scientific research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to replicate entire astrophysics research papers
Assessing faithfulness and correctness of AI agents in scientific workflows
Establishing benchmark for paper-scale expert-validated research tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework evaluates AI agents replicating astrophysics papers
Tasks co-developed with authors for objective evaluation
Benchmark measures faithfulness and correctness in research
Christine Ye
Christine Ye
Stanford University
S
Sihan Yuan
Stanford University
S
Suchetha Cooray
Stanford University
Steven Dillmann
Steven Dillmann
Stanford University, University of Cambridge
AI for ScienceMachine LearningData Driven DiscoveryComputational Mathematics
I
Ian L. V. Roque
Stanford University
D
Dalya Baron
Stanford University
P
Philipp Frank
Stanford University
Sergio Martin-Alvarez
Sergio Martin-Alvarez
Stanford University
N
Nolan Koblischke
University of Toronto
F
Frank J. Qu
Stanford University
Diyi Yang
Diyi Yang
Stanford University
Computational Social ScienceNatural Language ProcessingMachine Learning
Risa Wechsler
Risa Wechsler
Stanford University
I
Ioana Ciuca
Stanford University