🤖 AI Summary
This work addresses the inefficiency in notebook-based distributed workflows, where minor modifications often trigger full re-execution, severely hindering iterative development and reproducibility. To overcome this limitation, the authors propose NBRewind, a system that, for the first time, enables fine-grained incremental execution and cross-platform portability while preserving reproducibility. NBRewind integrates a dual-kernel architecture—comprising auditing and replay components—with cell-level incremental checkpoints and inter-cell dataflow analysis. It further leverages standardized notebook packaging to facilitate efficient partial re-execution. Evaluation in real-world high-performance computing (HPC) scenarios demonstrates that NBRewind incurs minimal overhead for incremental checkpointing and substantially improves both execution efficiency and cross-site reproducibility.
📝 Abstract
Notebooks provide an author-friendly environment for iterative development, modular execution, and easy sharing. Distributed workflows are increasingly being authored and executed in notebooks, yet sharing and reproducing them remains challenging. Even small code or parameter changes often force full end-to-end re-execution of the distributed workflow, limiting iterative development for such workloads. Current methods for improving notebook execution operate on single-node workflows, while optimization techniques for distributed workflows typically sacrifice reproducibility. We introduce NBRewind, a notebook kernel system for efficient, reproducible execution of distributed workflows in notebooks. NBRewind consists of two kernels--audit and repeat. The audit kernel performs incremental, cell-level checkpointing to avoid unnecessary re-runs; repeat reconstructs checkpoints and enables partial re-execution including notebook cells that manage distributed workflow. Both kernel methods are based on data-flow analysis across cells. We show how checkpoints and logs when packaged as part of standardized notebook specification improve sharing and reproducibility. Using real-world case studies we show that creating incremental checkpoints adds minimal overhead and enables portable, cross-site reproducibility of notebook-based distributed workflows on HPC systems.