FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address frequent hardware/software failures during large-scale training of large language models (LLMs), which cause costly interruptions and expensive recovery overhead, this paper proposes a highly scalable fault-tolerant training system. Methodologically, it abandons conventional periodic checkpointing in favor of proactive real-time failure detection, scale-invariant task restart, and checkpoint-free single-step recovery. The system integrates continuous state monitoring, a lightweight communication group reconstruction protocol, and differentiated node recovery strategies to ensure rapid synchronization and responsiveness across thousands of devices. Evaluated on a 4,800-GPU cluster, the system achieves consistent recovery times under 150 seconds—without significant degradation as scale increases. This design substantially improves training continuity and resource utilization in massive distributed LLM training.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.

Problem

Research questions and friction points this paper is trying to address.

Fast failure recovery for large-scale LLM training

Minimizing training time loss from hardware failures

Eliminating checkpoint overhead during system restoration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active real-time failure detection within seconds

Scale-independent task restart with constant recovery time

Checkpoint-free single-step recovery eliminating traditional overhead

🔎 Similar Papers

No similar papers found.