Robust LLM Training Infrastructure at ByteDance

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale LLM training on GPU clusters with tens of thousands of devices suffers from frequent failures—including CUDA errors, NaN overflows, and job hangs—severely compromising training stability and efficiency. To address this, we propose ByteRobust, a highly robust GPU infrastructure specifically designed for LLM training. It introduces the first data-driven, LLM-aware fault detection and recovery framework, integrating distributed real-time monitoring, fine-grained fault-injection-responsive diagnostics, and adaptive checkpoint management. This enables sub-second fault localization, precise root-cause isolation, and fully automated recovery. Deployed across a production platform with 200,000 GPUs, ByteRobust achieves an end-to-training reliability (ETTR) of 97% for three-month training jobs on 9,600 GPUs—marking a substantial improvement over baseline systems and representing the first system-level stability breakthrough at ultra-large scale.

Technology Category

Application Category

📝 Abstract
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
Problem

Research questions and friction points this paper is trying to address.

Addressing frequent failures in large-scale LLM training on thousands of GPUs
Ensuring minimal training interruption and efficient fault diagnosis for stability
Providing effective failure tolerance to enable continuous, efficient model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Routine failure detection and recovery system
Data-driven fault demarcation and localization
High-capacity fault tolerance leveraging LLM parallelism
🔎 Similar Papers
No similar papers found.
Borui Wan
Borui Wan
The University of Hong Kong
Large Language ModelComputer Systems
G
Gaohong Liu
Zuquan Song
Zuquan Song
Bytedance
J
Jun Wang
Y
Yun Zhang
Guangming Sheng
Guangming Sheng
the University of Hong Kong
S
Shuguang Wang
H
Houmin Wei
C
Chenyuan Wang
W
Weiqiang Lou
X
Xi Yang
M
Mofan Zhang
K
Kaihua Jiang
C
Cheng Ren
X
Xiaoyun Zhi
Menghan Yu
Menghan Yu
ByteDance
Machine Learning
Z
Zhe Nan
Z
Zhuolin Zheng
B
Baoquan Zhong
Q
Qinlong Wang
H
Huan Yu
J
Jinxin Chi
Wang Zhang
Wang Zhang
Tianjin University
Graph Representation Learning
Y
Yuhan Li
Z
Zixian Du