ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

The scale and complexity of modern scientific research render traditional peer review inadequate for effectively evaluating reproducibility. This work proposes a novel approach that frames reproducibility assessment as a structured reasoning task over scientific literature. It introduces an agent-based reasoning mechanism that leverages large language models to extract structured information, construct directed workflow graphs, and integrate multidimensional scoring for automated evaluation. The method enables consistent, cross-domain, and cross-model reconstruction of computational workflows, achieving accuracies of 61%, 60.71%, and 61.68% on the ReScience C, ReproBench, and GoldStandardDB benchmarks, respectively—substantially outperforming existing techniques.

📝 Abstract

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.

Problem

Research questions and friction points this paper is trying to address.

reproducibility

peer review

scientific evaluation

workflow reconstruction

computational reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Reproducibility Assessment

workflow graph extraction

computational reproducibility