An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the gradient communication bottleneck in large-scale distributed training, this paper proposes a novel Top-K compressor compatible with All-Reduce. Unlike Rand-K—which lacks structural sparsity and suffers from poor contraction—and standard Top-K—which breaks contraction and relies on high-overhead All-Gather—the proposed method achieves cross-node sparse pattern alignment via lightweight gradient summarization, eliminating index transmission. It is the first Top-K variant that preserves selection accuracy while restoring contraction and natively supporting All-Reduce. Integrated with the EF21M momentum-based error feedback mechanism, it significantly improves convergence speed and linear scaling efficiency. Experiments demonstrate up to 60.7% reduction in end-to-end training time without compromising model accuracy, showcasing superior efficiency, scalability, and system compatibility.

Technology Category

Application Category

📝 Abstract

Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.

Problem

Research questions and friction points this paper is trying to address.

Addresses communication bottlenecks in distributed machine learning systems

Improves gradient compression by preserving globally significant information

Enables efficient All-Reduce operations while maintaining contraction properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

All-Reduce compatible Top-K compressor for distributed learning

Aligns sparsity patterns using a lightweight gradient sketch

Enables index-free All-Reduce while preserving significant information

🔎 Similar Papers

LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression

2024-03-07arXiv.orgCitations: 2

TikTok

San Jose, California

Research Scientist, AI & Systems Co-design (PhD)