Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Software defects in large language model (LLM) distributed training/inference frameworks—such as DeepSpeed, Megatron-LM, and Colossal-AI—are increasingly prevalent and challenging to diagnose. Method: We conduct the first large-scale empirical study of 308 fixed defects using a mixed-method approach: defect taxonomy construction, root-cause analysis, repair-pattern mining, and engineering practice validation. Contribution/Results: We identify framework-specific error patterns—including communication anomalies and resource scheduling misconfigurations—and three key debugging challenges: symptom–root-cause decoupling, high reproduction cost, and strong cross-component coupling. We introduce the first defect-feature knowledge base tailored to LLM frameworks. Quantitative analysis reveals that 48% of defects are resolved with ≤10 lines of code. Finally, we propose actionable quality assurance recommendations, establishing an empirical foundation and novel pathways for automated debugging in LLM systems.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs' characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the distributed nature of these frameworks introduces unique bug root causes, such as allocation strategy error and distributed communication error. Diagnosing and fixing complex bugs remains challenging due to factors like the disconnect between symptoms and root causes, high bug reproduction costs, and low-level or cross-component interactions. Interestingly, we observe that 48% of bug fixes require minimal code changes (<=10 LOC) and follow simple strategies such as conditional logic optimization, parameter handling enhancement, or version compatibility handling, indicating potential for automation. Based on these insights, we offer several implications for improving the reliability of both distributed training and inference frameworks and their dependent LLM projects, while also identifying opportunities to leverage LLM-based tools for automated debugging and repair.
Problem

Research questions and friction points this paper is trying to address.

Analyze bugs in distributed LLM training/inference frameworks
Identify root causes and fixing strategies for framework bugs
Explore automation potential for debugging distributed system issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale empirical analysis of framework bugs
Examines symptoms, root causes, and fixing strategies
Identifies potential for automated bug fixes
🔎 Similar Papers
No similar papers found.
X
Xiao Yu
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
Haoxuan Chen
Haoxuan Chen
PhD Candidate at ICME, Stanford University
Applied and Computational MathematicsStatisticsMachine LearningScientific Computing
Feifei Niu
Feifei Niu
University of Ottawa
software engineeringempirical software engineeringrequirements engineering
X
Xing Hu
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
J
J. Keung
Department of Computer Science, City University of Hong Kong, Hong Kong, China
X
Xin Xia
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China