🤖 AI Summary
Distributed deep learning (DDL) frameworks suffer from a lack of systematic understanding of defects, hindering robustness and maintainability.
Method: We conduct the first large-scale empirical study of 849 real-world issues across DeepSpeed, Megatron-LM, and Colossal-AI, proposing the first defect taxonomy tailored to specialized DDL frameworks—comprising 34 symptom categories, 28 root-cause categories, and 6 repair patterns—and establishing a phase-aware symptom–cause–repair mapping across six execution stages: initialization, communication, computation, memory, fault tolerance, and scheduling.
Results: We find that 45.1% of symptoms—and 95% of communication-related configuration issues—are uniquely distributed in nature; over 60% of defects are resolvable via version/dependency management or distributed tuning. Setup failures, memory anomalies, and performance deviations emerge as the top three distributed-specific defect types. We quantify root-cause distributions per stage and distill reusable repair patterns and engineering best practices, directly supporting enhanced framework robustness.
📝 Abstract
In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.