🤖 AI Summary
To address frequent node failures, prolonged recovery times, and high checkpointing overhead in large-scale LLM training, this paper proposes an efficient fault-tolerant training system. Its core innovation is a near-zero-overhead, fine-grained asynchronous checkpointing mechanism leveraging idle network bandwidth: it employs distributed state streaming, network-aware scheduling, and memory-optimized snapshotting to overlap state persistence and restoration with GPU computation—eliminating rollback. Experiments demonstrate that, compared to conventional approaches, the system reduces recovery time by 98%, decreases GPU utilization loss by 68%, and preserves baseline training performance. Consequently, it significantly enhances training robustness and resource efficiency without compromising throughput or convergence behavior.
📝 Abstract
Recent developments in large language models (LLMs) have introduced new requirements for efficient and robust training. As LLM clusters scale, node failures, lengthy recoveries, and bulky checkpoints erode efficiency. Infrequent asynchronous checkpoints trigger costly rollbacks, yet higher frequencies add prohibitive overhead. To address these challenges, we propose FFTrainer, a system designed for robust LLM training. FFTrainer leverages surplus network capacity to quickly save and load states, thereby preventing rollbacks and accelerating recovery. Compared with prior checkpointing approaches, FFTrainer reduces recovery time by up to 98% and mitigates GPU utilization loss by up to 68% without hindering normal training.