🤖 AI Summary
Social science reproducibility assessment faces three key bottlenecks: high manual annotation costs, narrow coverage of existing benchmarks (which only test code/data execution while neglecting paper-result consistency), oversimplified evaluation scenarios, and insufficient diversity in document formats and programming languages. To address these, we introduce REPRO-Bench—the first end-to-end reproducibility evaluation benchmark specifically designed for social science research. It comprises 112 real-world reproduction tasks, supports heterogeneous input formats (e.g., PDF, LaTeX) and multiple programming languages, and leverages an AI agent framework integrating PDF parsing, code execution, and automated result verification against reported findings. Experiments reveal that state-of-the-art AI agents achieve only 21.4% accuracy on this benchmark. Our proposed REPRO-Agent—enhanced with optimized modules for citation grounding, executable snippet extraction, and semantic result alignment—achieves 36.7% accuracy, substantially advancing the frontier of automated scientific reproducibility assessment.
📝 Abstract
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.