Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

πŸ“… 2026-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the widespread irreproducibility of academic Jupyter Notebooks caused by environment drift, missing dependencies, and implicit execution assumptions. The authors propose the first web-oriented, automated reproducibility engineering pipeline that systematically reconstructs and evaluates repository-level execution environments for notebooks hosted on GitHub. By leveraging dependency inference, auto-generated Docker containers, and isolated execution, the pipeline enables large-scale assessment of reproducibility. A novel four-category execution outcome framework is introduced to quantify reproduction fidelity. Evaluation on 443 real-world notebooks shows that containerization resolves 66.7% of dependency-related failures; however, only 46.3% achieve high output fidelity, demonstrating that while containerization is necessary, it is insufficient for bit-for-bit reproducibility. These findings underscore the critical need for systematic reproducibility evaluation in computational research.
πŸ“ Abstract
Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.
Problem

Research questions and friction points this paper is trying to address.

computational reproducibility
Jupyter notebooks
environment drift
dependency management
reproducibility gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated containerization
computational reproducibility
Jupyter notebooks
dependency inference
repository-level execution
πŸ”Ž Similar Papers
No similar papers found.
S
Sheeba Samuel
Distributed and Self-organizing Systems, Chemnitz University of Technology, Chemnitz, Germany
Daniel Mietchen
Daniel Mietchen
FIZ Karlsruhe β€” Leibniz Institute for Information Infrastructure
Web-based collaborationopen scienceFAIR dataWikidatasustainable science
H
Hemanta Lo
Distributed and Self-organizing Systems, Chemnitz University of Technology, Chemnitz, Germany
Martin Gaedke
Martin Gaedke
Technische UniversitΓ€t Chemnitz
Web EngineeringService EngineeringSmart DataWeb of ThingsIntent-oriented Systems