Efficiently Reproducing Distributed Workflows in Notebook-based Systems

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in notebook-based distributed workflows, where minor modifications often trigger full re-execution, severely hindering iterative development and reproducibility. To overcome this limitation, the authors propose NBRewind, a system that, for the first time, enables fine-grained incremental execution and cross-platform portability while preserving reproducibility. NBRewind integrates a dual-kernel architecture—comprising auditing and replay components—with cell-level incremental checkpoints and inter-cell dataflow analysis. It further leverages standardized notebook packaging to facilitate efficient partial re-execution. Evaluation in real-world high-performance computing (HPC) scenarios demonstrates that NBRewind incurs minimal overhead for incremental checkpointing and substantially improves both execution efficiency and cross-site reproducibility.
📝 Abstract
Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.
Problem

Research questions and friction points this paper is trying to address.

distributed workflows
notebook reproducibility
iterative development
checkpointing
re-execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

notebook-based workflows
distributed computing
incremental checkpointing
reproducibility
data-flow analysis
🔎 Similar Papers
No similar papers found.
T
Talha Azaz
University of Missouri, Columbia, MO, USA
R
Raza Ahmad
DePaul University, Chicago, IL, USA
M
Md Saiful Islam
University of Notre Dame, Notre Dame, IN, USA
Douglas Thain
Douglas Thain
Professor, University of Notre Dame
Distributed systemscloudsworkflowsfilesystemsscientific computing
Tanu Malik
Tanu Malik
Associate Professor, University of Missouri, Columbia
Data Management SystemsData ProvenanceHPC systems