LowDiff: Efficient Frequent Checkpointing via Low-Cost Differential for High-Performance Distributed Training Systems

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the high storage overhead and performance degradation caused by frequent checkpointing in large-scale distributed deep learning, this paper proposes LowDiff, a general-purpose differential checkpointing framework. LowDiff pioneers the extension of differential checkpointing from recommendation systems to generic distributed training scenarios. It innovatively integrates layer-granularity gradient reuse, batched compressed gradient writes, dynamic checkpoint frequency adjustment, CPU-asynchronous persistence, and a lightweight snapshot mechanism—enabling lossless, compression-free per-iteration checkpointing. Experimental results across diverse representative workloads demonstrate that LowDiff incurs less than 3.1% runtime overhead, significantly improving fault tolerance efficiency and training throughput. By offering low-overhead, high-frequency, and broadly applicable fault tolerance, LowDiff establishes a new paradigm for high-performance distributed training.

Technology Category

Application Category

📝 Abstract

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. This paper proposes LowDiff, an efficient frequent checkpointing framework that extit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, LowDiff incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. We further enhance LowDiff with a layer-wise gradient reusing and snapshotting approach and a CPU-based asynchronous persistence strategy, enabling frequent checkpointing without gradient compression. Experiments on various workloads show that LowDiff can achieve checkpointing frequency up to per iteration with less than 3.1% runtime overhead.

Problem

Research questions and friction points this paper is trying to address.

Reducing checkpointing costs in distributed training systems

Enabling frequent checkpointing without performance degradation

Applying differential checkpointing beyond recommendation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses compressed gradients as differential checkpoints

Employs batched gradient write optimization for storage

Dynamically tunes checkpoint frequency and batching size

🔎 Similar Papers

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training