An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three critical limitations of NCCL in large-scale GPU training—inefficient peer-to-peer (P2P) communication, poor fault tolerance against RoCE NIC (RNIC) port failures, and difficulty in observing transient collective communication anomalies—this paper proposes ICCL, a novel communication library that is efficient, reliable, and highly observable. ICCL’s key innovations are: (1) offloading P2P communication from GPUs to CPU threads to free GPU SM resources; (2) a primary-backup queue pair (QP) mechanism enabling millisecond-level RNIC failover; and (3) microsecond-granularity sliding-window network monitoring for precise detection of transient anomalies. Experiments show ICCL improves P2P throughput by 23.4% and reduces latency by 28.5% over NCCL, yielding a 6.02% end-to-end training throughput gain. ICCL has operated stably in production for several months and is open-sourced.

Technology Category

Application Category

📝 Abstract
Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several challenges when using NCCL in production, including 1) limited efficiency with costly and cumbersome P2P communication, 2) poor tolerance to frequent RNIC port failures, and 3) insufficient observability of transient collective communication anomalies. To address these issues, we propose ICCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. ICCL offloads the P2P communication from GPU kernels to CPU threads for minimal SM consumption, and removes the redundant memory copies irrelevant to the actual communicating process. ICCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O(us) level. We open-source ICCL and deploy it in production training clusters for several months, with results showing that compared to NCCL, ICCL achieves a 23.4%/28.5% improvement in P2P throughput/latency as well as a 6.02% increase in training throughput. We also share the operating experience of ICCL in large-scale clusters, hoping to give the communities more insights on production-level collective communication libraries in LLM training.
Problem

Research questions and friction points this paper is trying to address.

Optimizing P2P communication efficiency in GPU clusters
Enhancing fault tolerance for frequent NIC port failures
Improving observability of transient network anomalies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloads P2P communication from GPU to CPU
Introduces primary-backup QP for NIC failures
Designs window-based monitor for network anomalies
🔎 Similar Papers
No similar papers found.
Z
Ziteng Chen
Infrawaves
Xiaohe Hu
Xiaohe Hu
Tsinghua University
machine learningsystem and architecture
M
Menghao Zhang
Beihang University
Y
Yanmin Jia
Infrawaves
Y
Yan Zhang
Infrawaves
M
Mingjun Zhang
Infrawaves
Da Liu
Da Liu
North China Electric Power University
Energy Supply Chain Management
F
Fangzheng Jiao
Beihang University
J
Jun Chen
Infrawaves
H
He Liu
Infrawaves
Aohan Zeng
Aohan Zeng
Tsinghua University
Large Language ModelsNatural Language Processing
S
Shuaixing Duan
Zhipu AI
R
Ruya Gu
Infrawaves
Yang Jing
Yang Jing
Infrawaves
B
Bowen Han
China Unicom Research Institute
Jiahao Cao
Jiahao Cao
Tsinghua University
Network Traffic AnalysisNetwork Protocol Security
W
Wei Chen
Infrawaves
W
Wenqi Xie
Infrawaves
Jinlong Hou
Jinlong Hou
Shanghai Innovation Institute (SII)
machine learningdeep learninghigh performance computingdrug discoverymedical
Y
Yuan Cheng
Shanghai Innovation Institute
B
Bohua Xu
China Unicom Research Institute
Mingwei Xu
Mingwei Xu
Computer Science, Tsinghua University
Internet architecture
C
Chunming Hu
Beihang University