Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

In traditional deep neural network training, frequent checkpointing severely impedes data-parallel training throughput and introduces a fundamental trade-off between checkpoint frequency and fault-tolerance overhead. This paper proposes ShadowCheck, a zero-overhead, per-iteration checkpointing mechanism. Its core innovation leverages the inherent gradient synchronization communication in data-parallel training: via a lightweight multicast abstraction, gradients are forwarded in real time to a CPU-driven shadow cluster, enabling online reconstruction of model states—eliminating GPU-side checkpoint storage entirely. ShadowCheck requires no modifications to training logic or throughput sacrifice. It achieves 5–34.5× higher checkpoint frequency, reduces post-failure recomputation by 80%–97.1%, and delivers 1.3–6.5× higher throughput than state-of-the-art systems at equivalent checkpoint frequencies—fully decoupling checkpoint frequency from performance cost.

Technology Category

Application Category

📝 Abstract

This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to create a checkpoint already exists in the network as gradients. Our core contribution is a new multicast abstraction that simultaneously delivers gradients to a separate CPU-based shadow cluster. The shadow maintains a checkpoint by applying those gradients to a copy of the model. Our evaluation shows that Checkmate performs per-iteration checkpointing with training throughput comparable to an ideal no-checkpoint baseline. Checkmate achieves 5 to 34.5x more frequent checkpointing compared to state-of-the-art checkpointing systems, resulting in 80% to 97.1% reduction in repeated work per failure. At the same checkpointing frequency, Checkmate delivers 1.3x to 6.5x throughput compared to other systems.

Problem

Research questions and friction points this paper is trying to address.

Eliminates training slowdown during model checkpointing

Uses network gradients for zero-overhead checkpoint replication

Enables frequent checkpointing without performance tradeoffs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses network gradients for zero-overhead checkpointing

Multicast abstraction for simultaneous gradient delivery

Shadow cluster maintains checkpoint via gradient application

🔎 Similar Papers

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training