CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of GPU memory and interconnect bandwidth limitations in multi-node large language model (LLM) training and inference, where conventional RDMA-based communication incurs high overhead. The authors propose, for the first time, a cross-node GPU collective communication library leveraging CXL-based shared memory pools, replacing RDMA with a memory-centric architecture to effectively tackle key issues such as synchronization, data interleaving, and communication parallelization. Their approach achieves speedups of 1.34×, 1.84×, 1.94×, and 1.04× on AllGather, Broadcast, Gather, and Scatter operations, respectively, and delivers an end-to-end 1.11× acceleration in LLM training while reducing hardware costs by up to 2.75×.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU communication. Our evaluation demonstrates that \name achieves average performance improvements of 1.34$\times$ for AllGather, 1.84$\times$ for Broadcast, 1.94$\times$ for Gather, and 1.04$\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of LLM training shows 1.11$\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}
Problem

Research questions and friction points this paper is trying to address.

LLM training
GPU memory pressure
interconnect bandwidth
multi-node communication
memory over-provisioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL memory pooling
GPU collectives
cross-node communication
memory-centric architecture
collective communication library
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5