A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Distributed deep learning (DDL) frameworks suffer from a lack of systematic understanding of defects, hindering robustness and maintainability. Method: We conduct the first large-scale empirical study of 849 real-world issues across DeepSpeed, Megatron-LM, and Colossal-AI, proposing the first defect taxonomy tailored to specialized DDL frameworks—comprising 34 symptom categories, 28 root-cause categories, and 6 repair patterns—and establishing a phase-aware symptom–cause–repair mapping across six execution stages: initialization, communication, computation, memory, fault tolerance, and scheduling. Results: We find that 45.1% of symptoms—and 95% of communication-related configuration issues—are uniquely distributed in nature; over 60% of defects are resolvable via version/dependency management or distributed tuning. Setup failures, memory anomalies, and performance deviations emerge as the top three distributed-specific defect types. We quantify root-cause distributions per stage and distill reusable repair patterns and engineering best practices, directly supporting enhanced framework robustness.

Technology Category

Application Category

📝 Abstract

In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.

Problem

Research questions and friction points this paper is trying to address.

Analyzes bugs in distributed deep learning frameworks

Identifies symptoms, causes, and fixes for training issues

Examines setup, memory, and performance anomalies in systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes real-world bugs in distributed deep learning frameworks

Creates taxonomy mapping symptoms, causes, and fixes systematically

Identifies version management and communication tuning as key solutions

🔎 Similar Papers

No similar papers found.

Authors to Follow