ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Large-scale language model pretraining on massive GPU clusters is highly susceptible to hardware failures, and existing fault-tolerance approaches struggle to balance training trajectory consistency with efficiency. This work proposes ReCoVer, the first parallelism-agnostic fine-grained fault-tolerance system, which ensures gradient statistical equivalence by maintaining a constant number of microbatches and introduces a three-layer decoupled protocol—fault-tolerant collective communication, intra-step fine-grained recovery, and dynamic load scheduling—to support 3D parallelism and HSDP. Experiments demonstrate that ReCoVer tolerates up to 256 randomly failed GPUs in a 512-GPU cluster, achieving a 2.23× improvement in effective throughput over checkpoint-restart baselines and processing 74.9% more training tokens under identical resource constraints.

📝 Abstract

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

Problem

Research questions and friction points this paper is trying to address.

fault tolerance

LLM pre-training

hardware failures

training trajectory consistency

resilient training

Innovation

Methods, ideas, or system contributions that make the work stand out.

fault-tolerant training

LLM pre-training

fine-grained recovery