Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In traditional deep neural network training, frequent checkpointing severely impedes data-parallel training throughput and introduces a fundamental trade-off between checkpoint frequency and fault-tolerance overhead. This paper proposes ShadowCheck, a zero-overhead, per-iteration checkpointing mechanism. Its core innovation leverages the inherent gradient synchronization communication in data-parallel training: via a lightweight multicast abstraction, gradients are forwarded in real time to a CPU-driven shadow cluster, enabling online reconstruction of model states—eliminating GPU-side checkpoint storage entirely. ShadowCheck requires no modifications to training logic or throughput sacrifice. It achieves 5–34.5× higher checkpoint frequency, reduces post-failure recomputation by 80%–97.1%, and delivers 1.3–6.5× higher throughput than state-of-the-art systems at equivalent checkpoint frequencies—fully decoupling checkpoint frequency from performance cost.

Technology Category

Application Category

📝 Abstract
This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to create a checkpoint already exists in the network as gradients. Our core contribution is a new multicast abstraction that simultaneously delivers gradients to a separate CPU-based shadow cluster. The shadow maintains a checkpoint by applying those gradients to a copy of the model. Our evaluation shows that Checkmate performs per-iteration checkpointing with training throughput comparable to an ideal no-checkpoint baseline. Checkmate achieves 5 to 34.5x more frequent checkpointing compared to state-of-the-art checkpointing systems, resulting in 80% to 97.1% reduction in repeated work per failure. At the same checkpointing frequency, Checkmate delivers 1.3x to 6.5x throughput compared to other systems.
Problem

Research questions and friction points this paper is trying to address.

Eliminates training slowdown during model checkpointing
Uses network gradients for zero-overhead checkpoint replication
Enables frequent checkpointing without performance tradeoffs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses network gradients for zero-overhead checkpointing
Multicast abstraction for simultaneous gradient delivery
Shadow cluster maintains checkpoint via gradient application
🔎 Similar Papers
No similar papers found.
A
Ankit Bhardwaj
Massachusetts Institute of Technology
Weiyang Wang
Weiyang Wang
MIT CSAIL
Computer systems and networking
J
Jeremy Carin
Massachusetts Institute of Technology
Adam Belay
Adam Belay
Associate Professor, MIT CSAIL
Computer Systems
Manya Ghobadi
Manya Ghobadi
MIT
Computer Networks