CCCL: In-GPU Compression-Coupled Collective Communication

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

266K/year
🤖 AI Summary
This work addresses the significant overhead of collective communication in large model training, where existing compute-communication overlap techniques require extensive code modifications and struggle to support complex parallelism strategies such as tensor and expert parallelism. To overcome these limitations, the authors propose CCCL, the first collective communication library that natively integrates compression without user intervention. CCCL leverages on-GPU compression, fuses communication with computation, and deeply integrates with NCCL to natively support operations like AllReduce and AllToAll while eliminating the data concatenation phase inherent in conventional compression approaches, thereby substantially reducing memory access overhead. Experimental results demonstrate that CCCL achieves up to a 10.1% improvement in end-to-end throughput under vLLM PD-decoupled workloads and up to 30% higher communication throughput in microbenchmarks.

Technology Category

Application Category

📝 Abstract
Collective communication incurs significant overhead in LLM workloads. Although overlapping communication with computation in application-level is a common strategy, it often requires substantial code modifications and is impractical for many workloads (e.g., tensor and expert parallelism). We present CCCL, a built-in compression-based collective communication library that supports operations such as allreduce, alltoall, and send/recv without requiring any user-side changes, thereby enabling seamless adoption in existing applications. CCCL tightly fuses compression kernels to minimize memory accesses and integrates with NCCL to eliminate the data coalescing stage, making it fast enough (up to 3x NVLink bandwidth) to sustain communication. Our evaluation shows that CCCL improves end-to-end throughput in vLLM PD disaggregation workloads by up to 10.1% and microbenchmark throughput by up to 30%.
Problem

Research questions and friction points this paper is trying to address.

collective communication
LLM workloads
communication overhead
tensor parallelism
expert parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

compression-coupled communication
in-GPU collective communication
NCCL integration
memory access optimization
zero-code-modification
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5