🤖 AI Summary
This work addresses the challenge of simultaneously achieving Byzantine robustness, communication compression, and effective small-batch training in distributed learning. To this end, we propose Byz-DM21 and its accelerated variant Byz-VR-DM21, which introduce a novel double-momentum gradient estimator that integrates error feedback with local variance reduction. Our methods enable Byzantine-resilient optimization under compressed communication without relying on large batch sizes. Theoretical analysis under the Polyak–Łojasiewicz condition shows that Byz-DM21 converges to an ε-stationary point in O(ε⁻⁴) iterations, while Byz-VR-DM21 further accelerates convergence to O(ε⁻³), significantly shrinking the convergence neighborhood and improving communication efficiency. Experimental results corroborate the efficacy of the proposed approaches.
📝 Abstract
In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.