FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

๐Ÿ“… 2025-09-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address frequent hardware/software failures during large-scale training of large language models (LLMs), which cause costly interruptions and expensive recovery overhead, this paper proposes a highly scalable fault-tolerant training system. Methodologically, it abandons conventional periodic checkpointing in favor of proactive real-time failure detection, scale-invariant task restart, and checkpoint-free single-step recovery. The system integrates continuous state monitoring, a lightweight communication group reconstruction protocol, and differentiated node recovery strategies to ensure rapid synchronization and responsiveness across thousands of devices. Evaluated on a 4,800-GPU cluster, the system achieves consistent recovery times under 150 secondsโ€”without significant degradation as scale increases. This design substantially improves training continuity and resource utilization in massive distributed LLM training.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.
Problem

Research questions and friction points this paper is trying to address.

Fast failure recovery for large-scale LLM training
Minimizing training time loss from hardware failures
Eliminating checkpoint overhead during system restoration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active real-time failure detection within seconds
Scale-independent task restart with constant recovery time
Checkpoint-free single-step recovery eliminating traditional overhead
๐Ÿ”Ž Similar Papers
No similar papers found.
Haijun Zhang
Haijun Zhang
Professor, IEEE Fellow, University of Science and Technology Beijing
6GAI enabled Wireless CommunicationsResource AllocationMobility Management
J
Jinxiang Wang
iFLYTEK AI Engineering Institute , Hefei 230088, China
Z
Zhenhua Yu
iFLYTEK AI Engineering Institute , Hefei 230088, China
Yanyong Zhang
Yanyong Zhang
University of Science and Technology of China ; Rutgers University (Adjunct Visiting Professor)
SensingCyber-Physical SystemsMulti-Modal PerceptionEfficient AI Systems
X
Xuejie Ji
iFLYTEK AI Engineering Institute , Hefei 230088, China
K
Kaining Mao
iFLYTEK AI Engineering Institute , Hefei 230088, China
J
Jun Zhang
iFLYTEK AI Engineering Institute , Hefei 230088, China
Y
Yaqing Zhang
iFLYTEK AI Engineering Institute , Hefei 230088, China
T
Ting Wu
iFLYTEK AI Engineering Institute , Hefei 230088, China
F
Fei Jie
iFLYTEK AI Engineering Institute , Hefei 230088, China
X
Xiemin Huang
iFLYTEK AI Engineering Institute , Hefei 230088, China
Z
Zhifang Cai
Huawei Technologies Co., Ltd , Shenzhen 518129, China
J
Junhua Cheng
Huawei Technologies Co., Ltd , Shenzhen 518129, China
S
Shuwei Wang
Huawei Technologies Co., Ltd , Shenzhen 518129, China
W
Wei Li
Huawei Technologies Co., Ltd , Shenzhen 518129, China
X
Xiaoming Bao
Huawei Technologies Co., Ltd , Shenzhen 518129, China
H
Hua Xu
Huawei Technologies Co., Ltd , Shenzhen 518129, China
Shixiong Zhao
Shixiong Zhao
University of Hong Kong
Distributed system
J
Jun Li
Huawei Technologies Co., Ltd , Shenzhen 518129, China
H
Hongwei Sun
Huawei Technologies Co., Ltd , Shenzhen 518129, China
Z
Ziyang Zhang
Huawei Technologies Co., Ltd , Shenzhen 518129, China
Y
Yi Xiong
Huawei Technologies Co., Ltd , Shenzhen 518129, China
C
Chunsheng Li
Huawei Technologies Co., Ltd , Shenzhen 518129, China