A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Distributed deep learning (DDL) frameworks suffer from a lack of systematic understanding of defects, hindering robustness and maintainability. Method: We conduct the first large-scale empirical study of 849 real-world issues across DeepSpeed, Megatron-LM, and Colossal-AI, proposing the first defect taxonomy tailored to specialized DDL frameworks—comprising 34 symptom categories, 28 root-cause categories, and 6 repair patterns—and establishing a phase-aware symptom–cause–repair mapping across six execution stages: initialization, communication, computation, memory, fault tolerance, and scheduling. Results: We find that 45.1% of symptoms—and 95% of communication-related configuration issues—are uniquely distributed in nature; over 60% of defects are resolvable via version/dependency management or distributed tuning. Setup failures, memory anomalies, and performance deviations emerge as the top three distributed-specific defect types. We quantify root-cause distributions per stage and distill reusable repair patterns and engineering best practices, directly supporting enhanced framework robustness.

Technology Category

Application Category

📝 Abstract
In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.
Problem

Research questions and friction points this paper is trying to address.

Analyzes bugs in distributed deep learning frameworks
Identifies symptoms, causes, and fixes for training issues
Examines setup, memory, and performance anomalies in systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes real-world bugs in distributed deep learning frameworks
Creates taxonomy mapping symptoms, causes, and fixes systematically
Identifies version management and communication tuning as key solutions
🔎 Similar Papers
No similar papers found.
X
Xiaoxue Ma
Hong Kong Metropolitan University, China
W
Wanwei Zhan
Wuhan University of Technology, China
J
Jiale Chen
Wuhan University of Technology, China
Y
Yishu Li
Hong Kong Metropolitan University, China
J
Jacky Keung
City University of Hong Kong, China
Federica Sarro
Federica Sarro
Professor, University College London
AI EngineeringSBSEAutomated Software EngineeringEmpirical Software Engineering