Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the widespread irreproducibility of academic Jupyter Notebooks caused by environment drift, missing dependencies, and implicit execution assumptions. The authors propose the first web-oriented, automated reproducibility engineering pipeline that systematically reconstructs and evaluates repository-level execution environments for notebooks hosted on GitHub. By leveraging dependency inference, auto-generated Docker containers, and isolated execution, the pipeline enables large-scale assessment of reproducibility. A novel four-category execution outcome framework is introduced to quantify reproduction fidelity. Evaluation on 443 real-world notebooks shows that containerization resolves 66.7% of dependency-related failures; however, only 46.3% achieve high output fidelity, demonstrating that while containerization is necessary, it is insufficient for bit-for-bit reproducibility. These findings underscore the critical need for systematic reproducibility evaluation in computational research.

Technology Category

Application Category

📝 Abstract

Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.

Problem

Research questions and friction points this paper is trying to address.

computational reproducibility

Jupyter notebooks

environment drift

dependency management

reproducibility gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated containerization

computational reproducibility

Jupyter notebooks