Understanding Stragglers in Large Model Training Using What-if Analysis

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Stragglers severely degrade performance in large language model (LLM) distributed training. Method: Based on five months of real-world training cluster data from ByteDance, we propose a multidimensional analytical framework integrating distributed monitoring and tracing, GPU-level performance profiling, temporal pattern mining, and *what-if* causal inference. Contribution/Results: We systematically attribute straggler root causes, revealing their multi-source nature—extending beyond hardware failures—and quantifying an average 37% training time waste. Network congestion, GPU memory thrashing, and scheduler-induced resource contention are identified as the top three contributors, each exhibiting predictable spatiotemporal patterns. Leveraging these insights, we design a deployable, real-time straggler early-warning module. This work advances observability and robustness optimization for distributed AI training systems through both theoretical grounding and practical implementation.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
Problem

Research questions and friction points this paper is trying to address.

Studying stragglers' impact on large language model training performance
Identifying temporal and spatial patterns of straggler occurrences
Investigating root causes of stragglers in distributed GPU clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses what-if analysis for straggler investigation
Analyzes five-month trace from LLM training
Simulates straggler-free scenarios for comparison
🔎 Similar Papers
No similar papers found.
Jinkun Lin
Jinkun Lin
New York University
Ziheng Jiang
Ziheng Jiang
Research Scientist, ByteDance
SystemsMachine Learning
Zuquan Song
Zuquan Song
Bytedance
S
Sida Zhao
ByteDance Seed
Menghan Yu
Menghan Yu
ByteDance
Machine Learning
Z
Zhanghan Wang
New York University
C
Chenyuan Wang
ByteDance Seed
Z
Zuocheng Shi
Zhejiang University
X
Xiang Shi
ByteDance
W
Wei Jia
ByteDance Seed
Z
Zherui Liu
ByteDance Seed
S
Shuguang Wang
ByteDance Seed
Haibin Lin
Haibin Lin
Bytedance
Machine Learning SystemsNatural Language Processing
X
Xiu Liu
ByteDance Seed
Aurojit Panda
Aurojit Panda
NYU
Distributed SystemsNetworkingCluster Computing
J
Jinyang Li
New York University