HetCCL: Accelerating LLM Training with Heterogeneous GPUs

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of efficient collective communication support across heterogeneous GPUs from vendors such as NVIDIA and AMD in existing deep learning frameworks, which hinders large model training efficiency. To overcome this limitation, we propose HetCCLβ€”the first library that enables high-performance cross-vendor GPU collective communication without requiring modifications to drivers or applications. HetCCL unifies NCCL and RCCL backends and constructs cross-vendor communication paths using RDMA, while introducing two novel mechanisms to ensure compatibility with mainstream deep learning frameworks. Experimental results demonstrate that HetCCL matches the performance of native communication libraries in homogeneous environments and significantly enhances scalability and training efficiency for large models in heterogeneous GPU clusters.

Technology Category

Application Category

πŸ“ Abstract
The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous GPUs, leading to inefficiency and higher costs. We present HetCCL, a collective communication library that unifies vendor-specific backends and enables RDMA-based communication across GPUs without requiring driver modifications. HetCCL introduces two novel mechanisms that enable cross-vendor communication while leveraging optimized vendor libraries, NVIDIA NCCL and AMD RCCL. Evaluations on a multi-vendor GPU cluster show that HetCCL matches NCCL and RCCL performance in homogeneous setups while uniquely scaling in heterogeneous environments, enabling practical, high-performance training with both NVIDIA and AMD GPUs without changes to existing deep learning applications.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous GPUs
collective communication
LLM training
multi-vendor GPU cluster
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous GPUs
collective communication
RDMA
LLM training
cross-vendor
πŸ”Ž Similar Papers
No similar papers found.
H
Heehoon Kim
Moreh Inc., Seoul, South Korea
Jaehwan Lee
Jaehwan Lee
Korea Aerospace University
System SWDistributed ComputingCloud ComputingBig DataAI
T
Taejeoung Kim
Samsung Research, Seoul, South Korea
Jongwon Park
Jongwon Park
KAIST, Korea Atomic Energy Research Institute (KAERI)
robotcontrolnucleardisaster
J
Jinpyo Kim
Dept. of Computer Science and Engineering, Seoul National University, Seoul, South Korea
P
Pyongwon Suh
Samsung Research, Seoul, South Korea
R
Ryan H. Choi
Samsung Research, Seoul, South Korea
S
Sangwoo Lee
Samsung Research, Seoul, South Korea
Jaejin Lee
Jaejin Lee
Dept. of Compter Science and Engineering, Seoul National University
Parallel processingCompilersComputer architecturesOperating systemsHeterogeneous computing