An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the gradient communication bottleneck in large-scale distributed training, this paper proposes a novel Top-K compressor compatible with All-Reduce. Unlike Rand-K—which lacks structural sparsity and suffers from poor contraction—and standard Top-K—which breaks contraction and relies on high-overhead All-Gather—the proposed method achieves cross-node sparse pattern alignment via lightweight gradient summarization, eliminating index transmission. It is the first Top-K variant that preserves selection accuracy while restoring contraction and natively supporting All-Reduce. Integrated with the EF21M momentum-based error feedback mechanism, it significantly improves convergence speed and linear scaling efficiency. Experiments demonstrate up to 60.7% reduction in end-to-end training time without compromising model accuracy, showcasing superior efficiency, scalability, and system compatibility.

Technology Category

Application Category

📝 Abstract
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
Problem

Research questions and friction points this paper is trying to address.

Addresses communication bottlenecks in distributed machine learning systems
Improves gradient compression by preserving globally significant information
Enables efficient All-Reduce operations while maintaining contraction properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

All-Reduce compatible Top-K compressor for distributed learning
Aligns sparsity patterns using a lightweight gradient sketch
Enables index-free All-Reduce while preserving significant information
🔎 Similar Papers
No similar papers found.
C
Chuyan Chen
Peking University, Beijing, China
Chenyang Ma
Chenyang Ma
University of Oxford
Embodied AIVLAAgents
Z
Zhangxin Li
Peking University, Beijing, China
Y
Yutong He
Peking University, Beijing, China
Yanjie Dong
Yanjie Dong
Associate Professor, Shenzhen MSU-BIT University
Machine learning and optimizationwireless for AI
K
Kun Yuan
Peking University, Beijing, P. R. China