ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work proposes ReproScore, a novel framework that explicitly decouples reproducibility readiness (RRS) from reproducibility outcomes (ROS)—a distinction often conflated in existing tools that mistakenly treat static repository completeness as a proxy for successful execution. RRS comprises 26 fine-grained submetrics assessing code and documentation quality, while ROS is derived through sandboxed execution probes. The two dimensions are integrated into an adaptive composite score (RCS), with customizable metric weights via versioned YAML configurations contributed by the community. Empirical evaluation across 423 GitHub repositories spanning five failure modes demonstrates that environment-related metrics effectively discriminate failure types, yet reveals a near-zero correlation between RRS and actual execution success—exposing a critical “readiness–outcome gap” and thereby validating the necessity and efficacy of the proposed architectural separation.

📝 Abstract

Digital libraries curate millions of research software artefacts yet lack scalable infrastructure for assessing whether those artefacts remain executable. Existing automated assessment tools treat static repository completeness -- what a repository contains -- as a proxy for execution success -- whether it runs. We term this the readiness-outcome conflation and present ReproScore, a two-tier framework that explicitly separates reproducibility readiness (RRS) from reproducibility outcome (ROS), combining them into a coverage-adaptive Composite Score (RCS). RRS comprises 26 sub-metrics across five categories; ROS provides execution-based probes when sandbox infrastructure is available; a community rubric externalises weighting priorities as versioned YAML profiles. Evaluated on 423 GitHub repositories from a large-scale ground-truth corpus spanning five failure modes, two complementary findings emerge: the environment category strongly discriminates failure mode, confirming static signals capture meaningful structural differences; yet RRS exhibits near-zero binary success correlation, empirically quantifying the readiness-outcome gap at repository scale. Together, these findings validate the architectural separation as both necessary and non-trivial, positioning ReproScore as scalable infrastructure for reproducibility-aware curation in digital library workflows.

Problem

Research questions and friction points this paper is trying to address.

reproducibility

readiness-outcome conflation

research software

executable assessment

digital libraries

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReproScore

reproducibility readiness

reproducibility outcome