REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Social science reproducibility assessment faces three key bottlenecks: high manual annotation costs, narrow coverage of existing benchmarks (which only test code/data execution while neglecting paper-result consistency), oversimplified evaluation scenarios, and insufficient diversity in document formats and programming languages. To address these, we introduce REPRO-Bench—the first end-to-end reproducibility evaluation benchmark specifically designed for social science research. It comprises 112 real-world reproduction tasks, supports heterogeneous input formats (e.g., PDF, LaTeX) and multiple programming languages, and leverages an AI agent framework integrating PDF parsing, code execution, and automated result verification against reported findings. Experiments reveal that state-of-the-art AI agents achieve only 21.4% accuracy on this benchmark. Our proposed REPRO-Agent—enhanced with optimized modules for citation grounding, executable snippet extraction, and semantic result alignment—achieves 36.7% accuracy, substantially advancing the frontier of automated scientific reproducibility assessment.

Technology Category

Application Category

📝 Abstract
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' ability to assess social science paper reproducibility
Addressing limitations in existing benchmarks for research reproducibility
Developing REPRO-Bench for end-to-end reproducibility assessment tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

REPRO-Bench evaluates AI agents on reproducibility assessment
End-to-end tasks with real-world complexity and diversity
REPRO-Agent improves accuracy by 71% over existing agents
🔎 Similar Papers
No similar papers found.
Chuxuan Hu
Chuxuan Hu
University of Illinois at Urbana Champaign
L
Liyun Zhang
Shanghai Jiao Tong University
Y
Yeji Lim
University of Illinois Urbana-Champaign
A
Aum Wadhwani
University of Illinois Urbana-Champaign
A
Austin Peters
University of Chicago
Daniel Kang
Daniel Kang
UIUC
Computer Science