CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

This work addresses the challenges of GPU memory and interconnect bandwidth limitations in multi-node large language model (LLM) training and inference, where conventional RDMA-based communication incurs high overhead. The authors propose, for the first time, a cross-node GPU collective communication library leveraging CXL-based shared memory pools, replacing RDMA with a memory-centric architecture to effectively tackle key issues such as synchronization, data interleaving, and communication parallelization. Their approach achieves speedups of 1.34×, 1.84×, 1.94×, and 1.04× on AllGather, Broadcast, Gather, and Scatter operations, respectively, and delivers an end-to-end 1.11× acceleration in LLM training while reducing hardware costs by up to 2.75×.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU communication. Our evaluation demonstrates that \name achieves average performance improvements of 1.34$\times$ for AllGather, 1.84$\times$ for Broadcast, 1.94$\times$ for Gather, and 1.04$\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of LLM training shows 1.11$\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}

Problem

Research questions and friction points this paper is trying to address.

LLM training

GPU memory pressure

interconnect bandwidth

multi-node communication

memory over-provisioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL memory pooling

GPU collectives

cross-node communication