Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of conventional synchronous training in large-scale GPU clusters—specifically those with over 100,000 GPUs—where frequent hardware failures and prolonged recovery procedures severely degrade performance. The authors propose a fault-tolerant hybrid sharded data parallelism (FT-HSDP) paradigm that, for the first time, treats data-parallel replicas as independent fault-tolerance units. By integrating dynamic participant management and a non-blocking catch-up mechanism, FT-HSDP enables localized fault recovery: only affected replicas are restarted while others continue training uninterrupted. The system employs a CPU-coordinated, GPU-executed fault-tolerant all-reduce protocol (FTAR) that supports efficient asynchronous recovery without compromising model accuracy. Experiments on a 100,000-GPU cluster demonstrate that this approach reduces fault-induced training stalls from 10 minutes to 3 minutes and increases effective training time from 44% to 80%.

Technology Category

Application Category

📝 Abstract
Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44\% to 80\%. We further demonstrate that FT-HSDP's asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.
Problem

Research questions and friction points this paper is trying to address.

large-scale training
synchronous training
GPU failures
training efficiency
fault tolerance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fault Tolerant HSDP
Large-scale LLM training
FTAR protocol
Non-blocking catch-up
Data parallelism
🔎 Similar Papers
No similar papers found.
O
Omkar Salpekar
Meta Platforms
Rohan Varma
Rohan Varma
Graduate Student, Carnegie Mellon University
Signal ProcessingHarmonic AnalysisMachine LearningStatistics
K
Kenny Yu
Thinking Machines Labs
Vladimir Ivanov
Vladimir Ivanov
Innopolis University
software engineeringnatural language processingknowledge engineeringinformation integration
Y
Yang Wang
Meta Platforms
A
Ahmed Sharif
Meta Platforms
Min Si
Min Si
Meta
High performance computing for AIdistributed communication systems
Shawn Xu
Shawn Xu
Google LLC
Machine LearningComputer VisionArtificial Intelligence
F
Feng Tian
Meta Platforms
S
Shengbao Zheng
Meta Platforms
T
Tristan Rice
Meta Platforms
A
Ankush Garg
Meta Platforms
S
Shangfu Peng
Meta Platforms
S
Shreyas Siravara
Meta Platforms
W
Wenyin Fu
Meta Platforms
R
Rodrigo de Castro
Meta Platforms
A
Adithya Gangidi
Meta Platforms
A
Andrey Obraztsov
Meta Platforms
Sharan Narang
Sharan Narang
Director, AI Research, Meta
Artificial IntelligenceDeep Learning
Sergey Edunov
Sergey Edunov
Facebook AI Research
LLMMachine TranslationNatural Language Processing
Maxim Naumov
Maxim Naumov
Meta (Director of Engineering & Research)
Parallel AlgorithmsNumerical Linear AlgebraNumerical OptimizationGraphsDeep Learning
C
Chunqiang Tang
Meta Platforms
M
Mathew Oldham
Meta Platforms