FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bandwidth bottleneck in cross-GPU communication during distributed training and inference of large language models (LLMs), this paper proposes the first efficient communication paradigm supporting arbitrary bit-width quantization—down to 2 bits. Methodologically, it innovatively integrates bit decomposition (mapping non-native bit widths onto hardware-supported primitives) with spike preservation (explicit retention of extreme values to suppress quantization error), under a holistic hardware-software co-design framework compatible with NVLink and PCIe interconnects. The approach optimizes both AllReduce and All2All collective communication primitives. Experimental results demonstrate up to 3.2× speedup for AllReduce and 2× acceleration for All2All, while maintaining acceptable accuracy degradation. This significantly enhances communication flexibility, hardware resource utilization, and the practical limits of ultra-low-bit quantization in LLM distributed systems.

Technology Category

Application Category

📝 Abstract
Nowadays, communication bottlenecks have emerged as a critical challenge in the distributed training and deployment of large language models (LLMs). This paper introduces FlashCommunication V2, a novel communication paradigm enabling efficient cross-GPU transmission at arbitrary bit widths. Its core innovations lie in the proposed bit splitting and spike reserving techniques, which address the challenges of low-bit quantization. Bit splitting decomposes irregular bit widths into basic units, ensuring compatibility with hardware capabilities and thus enabling transmission at any bit width. Spike reserving, on the other hand, retains numerical outliers (i.e., minima and maxima) as floating-point numbers, which shrinks the dynamic numerical range and pushes the quantization limits to 2-bit with acceptable losses. FlashCommunication V2 significantly enhances the flexibility and resource utilization of communication systems. Through meticulous software-hardware co-design, it delivers robust performance and reduced overhead across both NVLink-based and PCIe-based architectures, achieving a maximum 3.2$ imes$ speedup in AllReduce and 2$ imes$ in All2All communication.
Problem

Research questions and friction points this paper is trying to address.

Addresses communication bottlenecks in distributed LLM training
Enables efficient cross-GPU transmission at arbitrary bit widths
Improves flexibility and resource utilization in communication systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit splitting enables arbitrary bit width transmission
Spike reserving retains outliers as floating-point numbers
Software-hardware co-design boosts performance and flexibility
🔎 Similar Papers
No similar papers found.
Qingyuan Li
Qingyuan Li
Meituan
AutoMLNeural Network CompressionHardware AccelerationLarge Language ModelAIGC
B
Bo Zhang
Meituan
H
Hui Kang
NVIDIA
T
Tianhao Xu
NVIDIA
Y
Yulei Qian
Meituan
Y
Yuchen Xie
Meituan
L
Lin Ma
Meituan