Robust LLM Training Infrastructure at ByteDance

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Large-scale LLM training on GPU clusters with tens of thousands of devices suffers from frequent failures—including CUDA errors, NaN overflows, and job hangs—severely compromising training stability and efficiency. To address this, we propose ByteRobust, a highly robust GPU infrastructure specifically designed for LLM training. It introduces the first data-driven, LLM-aware fault detection and recovery framework, integrating distributed real-time monitoring, fine-grained fault-injection-responsive diagnostics, and adaptive checkpoint management. This enables sub-second fault localization, precise root-cause isolation, and fully automated recovery. Deployed across a production platform with 200,000 GPUs, ByteRobust achieves an end-to-training reliability (ETTR) of 97% for three-month training jobs on 9,600 GPUs—marking a substantial improvement over baseline systems and representing the first system-level stability breakthrough at ultra-large scale.

Technology Category

Application Category

📝 Abstract

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Problem

Research questions and friction points this paper is trying to address.

Addressing frequent failures in large-scale LLM training on thousands of GPUs

Ensuring minimal training interruption and efficient fault diagnosis for stability

Providing effective failure tolerance to enable continuous, efficient model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Routine failure detection and recovery system

Data-driven fault demarcation and localization

High-capacity fault tolerance leveraging LLM parallelism

🔎 Similar Papers

No similar papers found.