🤖 AI Summary
The scale and complexity of modern scientific research render traditional peer review inadequate for effectively evaluating reproducibility. This work proposes a novel approach that frames reproducibility assessment as a structured reasoning task over scientific literature. It introduces an agent-based reasoning mechanism that leverages large language models to extract structured information, construct directed workflow graphs, and integrate multidimensional scoring for automated evaluation. The method enables consistent, cross-domain, and cross-model reconstruction of computational workflows, achieving accuracies of 61%, 60.71%, and 61.68% on the ReScience C, ReproBench, and GoldStandardDB benchmarks, respectively—substantially outperforming existing techniques.
📝 Abstract
Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.