CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
📝 Abstract
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.
Problem

Research questions and friction points this paper is trying to address.

slow/hang anomalies
collective communication libraries
large-scale model training
diagnostic system
distributed training
Innovation

Methods, ideas, or system contributions that make the work stand out.

CCL-D
slow/hang anomaly
distributed tracing
root-cause localization
collective communication
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5
Y
Yida Gu
University of Chinese Academy of Sciences
F
Fakang Wang
Ant Group
J
Jianhao Fu
Ant Group
Z
Zhenhang Sun
Ant Group
Q
Qianyu Zhang
Ant Group
H
Hairui Zhao
Jilin University
X
Xingchen Liu
University of Chinese Academy of Sciences
Y
Yang Tian
Ant Group
Wenjing Huang
Wenjing Huang
RAND Corporation
PsychometricsStructural Equation ModelingItem Response TheoryCyber Security
Z
Zedong Liu
University of Chinese Academy of Sciences
Y
Yifan Chen
Ant Group
J
Jinwu Yang
University of Chinese Academy of Sciences
Y
Yueyuan Zhou
University of Chinese Academy of Sciences
Q
Qian Zhao
Ant Group
H
Haoxu Li
University of Chinese Academy of Sciences
T
Tao Wang
Ant Group
Feng Yu
Feng Yu
University of Exeter
Efficient AIContinual LearningFederated LearningFoundation Model
Z
Zhan Wang
University of Chinese Academy of Sciences
G
Guangming Tan
University of Chinese Academy of Sciences
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU