FFTrainer: Fast Failover in Large-Language Model Training with Almost-Free State Management

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address frequent node failures, prolonged recovery times, and high checkpointing overhead in large-scale LLM training, this paper proposes an efficient fault-tolerant training system. Its core innovation is a near-zero-overhead, fine-grained asynchronous checkpointing mechanism leveraging idle network bandwidth: it employs distributed state streaming, network-aware scheduling, and memory-optimized snapshotting to overlap state persistence and restoration with GPU computation—eliminating rollback. Experiments demonstrate that, compared to conventional approaches, the system reduces recovery time by 98%, decreases GPU utilization loss by 68%, and preserves baseline training performance. Consequently, it significantly enhances training robustness and resource efficiency without compromising throughput or convergence behavior.

Technology Category

Application Category

📝 Abstract
Recent developments in large language models (LLMs) have introduced new requirements for efficient and robust training. As LLM clusters scale, node failures, lengthy recoveries, and bulky checkpoints erode efficiency. Infrequent asynchronous checkpoints trigger costly rollbacks, yet higher frequencies add prohibitive overhead. To address these challenges, we propose FFTrainer, a system designed for robust LLM training. FFTrainer leverages surplus network capacity to quickly save and load states, thereby preventing rollbacks and accelerating recovery. Compared with prior checkpointing approaches, FFTrainer reduces recovery time by up to 98% and mitigates GPU utilization loss by up to 68% without hindering normal training.
Problem

Research questions and friction points this paper is trying to address.

Addresses node failures and slow recovery in large-scale LLM training.
Reduces overhead from frequent checkpoints that degrade training efficiency.
Minimizes GPU utilization loss and accelerates recovery after failures.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages surplus network capacity for state management
Prevents rollbacks and accelerates recovery process
Reduces recovery time and GPU utilization loss
🔎 Similar Papers
No similar papers found.
Bohan Zhao
Bohan Zhao
Scripps Research Institute/HHMI
NeuroscienceMemoryMetabolismSleep
Y
Yuanhong Wang
Tsinghua University
C
Chenglin Liu
Tsinghua University
J
Jiagi Pan
Tsinghua University
G
Guang Yang
Tsinghua University
R
Ruitao Liu
Tsinghua University
Tingrui Zhang
Tingrui Zhang
Zhejiang University
motion-planninggraphicsEmbodied-AI
K
Kai Luo
Tsinghua University
W
Wei Xu
Tsinghua University