🤖 AI Summary
This work addresses the significant overhead of collective communication in large model training, where existing compute-communication overlap techniques require extensive code modifications and struggle to support complex parallelism strategies such as tensor and expert parallelism. To overcome these limitations, the authors propose CCCL, the first collective communication library that natively integrates compression without user intervention. CCCL leverages on-GPU compression, fuses communication with computation, and deeply integrates with NCCL to natively support operations like AllReduce and AllToAll while eliminating the data concatenation phase inherent in conventional compression approaches, thereby substantially reducing memory access overhead. Experimental results demonstrate that CCCL achieves up to a 10.1% improvement in end-to-end throughput under vLLM PD-decoupled workloads and up to 30% higher communication throughput in microbenchmarks.
📝 Abstract
Collective communication incurs significant overhead in LLM workloads. Although overlapping communication with computation in application-level is a common strategy, it often requires substantial code modifications and is impractical for many workloads (e.g., tensor and expert parallelism). We present CCCL, a built-in compression-based collective communication library that supports operations such as allreduce, alltoall, and send/recv without requiring any user-side changes, thereby enabling seamless adoption in existing applications. CCCL tightly fuses compression kernels to minimize memory accesses and integrates with NCCL to eliminate the data coalescing stage, making it fast enough (up to 3x NVLink bandwidth) to sustain communication. Our evaluation shows that CCCL improves end-to-end throughput in vLLM PD disaggregation workloads by up to 10.1% and microbenchmark throughput by up to 30%.