CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

228K/year

📝 Abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

Problem

Research questions and friction points this paper is trying to address.

slow/hang anomalies

collective communication libraries

large-scale model training

diagnostic system

distributed training

Innovation

Methods, ideas, or system contributions that make the work stand out.

CCL-D

slow/hang anomaly

distributed tracing