Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

278K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving Byzantine robustness, communication compression, and effective small-batch training in distributed learning. To this end, we propose Byz-DM21 and its accelerated variant Byz-VR-DM21, which introduce a novel double-momentum gradient estimator that integrates error feedback with local variance reduction. Our methods enable Byzantine-resilient optimization under compressed communication without relying on large batch sizes. Theoretical analysis under the Polyak–Łojasiewicz condition shows that Byz-DM21 converges to an ε-stationary point in O(ε⁻⁴) iterations, while Byz-VR-DM21 further accelerates convergence to O(ε⁻³), significantly shrinking the convergence neighborhood and improving communication efficiency. Experimental results corroborate the efficacy of the proposed approaches.

Technology Category

Application Category

📝 Abstract

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Byzantine robustness

distributed learning

communication compression

stochastic optimization

variance reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Byzantine robustness

double momentum

variance reduction