NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the bandwidth-bound performance bottleneck of collective communication in multi-node GPU workloads, which critically limits scientific computing and distributed deep learning. It presents the first approach to decouple quantization and entropy coding within the native NCCL framework: quantization is applied at the interface layer, while entropy coding is embedded directly into NCCL primitives. A lightweight on-device selector dynamically chooses optimal codes, enabling efficient overlap between compression and communication. This design significantly enhances both compression ratio and flexibility. Evaluated on scientific datasets, training gradients, and synthetic workloads, the method achieves up to 9.65× speedup over native NCCL and up to 3.34× improvement compared to existing compression libraries.

📝 Abstract

Collective communication is a major bottleneck for multi-node GPU workloads in scientific computing and distributed deep learning, especially when inter-node bandwidth is limited. Although NCCL provides optimized GPU-centric collectives, large messages can still dominate end-to-end performance. Existing compression-enabled collective libraries either rely on MPI-based stacks that cannot fully exploit NCCL, omit entropy coding, or tightly couple full compressors with communication primitives, limiting compression ratio, flexibility, and communication-computation overlap. This paper presents NCCLZ, a compression-enabled GPU collectives that decouples quantization and entropy coding and integrates them at different layers of the stack. NCCLZ places quantization at the interface, embeds entropy coding into NCCL primitives, uses a lightweight device-side selector to choose coding strategies, and overlaps compression with communication to reduce exposed overhead. Experiments on scientific datasets, training gradients, and synthetic workloads show up to 9.65x speedup over NCCL and up to 3.34x improvement over prior compression-assisted collective libraries.

Problem

Research questions and friction points this paper is trying to address.

collective communication

GPU

compression

NCCL

entropy coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled quantization

entropy coding

GPU collectives