ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

250K/year
🤖 AI Summary
This work addresses the critical communication bottleneck in distributed large language model training, where conventional lossless compression methods are impractical due to their high computational overhead. The authors propose an efficient lossless compression scheme that exploits the near-Gaussian distribution of training tensors to design an exponential coding approach that eliminates the need for online statistics. The method integrates communication-aware data layout, GPU-optimized encoding and decoding kernels, and an adaptive strategy that dynamically switches collective communication operations. Evaluated on a 64-GPU cluster, the proposed technique reduces communication time by up to 1.35× and accelerates end-to-end training by 1.18×, all while preserving model accuracy.
📝 Abstract
Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.
Problem

Research questions and friction points this paper is trying to address.

communication bottleneck
lossless compression
distributed training
large language models
communication overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

lossless compression
communication collectives
Gaussian distribution
GPU-optimized kernels
adaptive communication