🤖 AI Summary
To address the gradient communication bottleneck in large-scale distributed training, this paper proposes a novel Top-K compressor compatible with All-Reduce. Unlike Rand-K—which lacks structural sparsity and suffers from poor contraction—and standard Top-K—which breaks contraction and relies on high-overhead All-Gather—the proposed method achieves cross-node sparse pattern alignment via lightweight gradient summarization, eliminating index transmission. It is the first Top-K variant that preserves selection accuracy while restoring contraction and natively supporting All-Reduce. Integrated with the EF21M momentum-based error feedback mechanism, it significantly improves convergence speed and linear scaling efficiency. Experiments demonstrate up to 60.7% reduction in end-to-end training time without compromising model accuracy, showcasing superior efficiency, scalability, and system compatibility.
📝 Abstract
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.