🤖 AI Summary
This work addresses the lack of a unified convergence analysis framework for Byzantine-robust distributed optimization, particularly under general data heterogeneity. It establishes a comprehensive theoretical framework encompassing both momentum and non-momentum variants of distributed stochastic gradient descent (SGD). Under non-convex smooth objectives and the Polyak–Łojasiewicz condition, the study derives tight convergence upper and lower bounds for the first time under general heterogeneity assumptions. These bounds reveal that local momentum effectively suppresses stochastic noise and precisely characterize the fundamental performance limits of Byzantine-robust learning in the presence of both stochasticity and data heterogeneity. The theoretical analysis demonstrates matching upper and lower bounds, and empirical experiments validate the efficacy of local momentum in enhancing robustness.
📝 Abstract
Byzantine-robust distributed optimization relies on robust aggregation rules to mitigate the influence of malicious Byzantine workers. Despite the proliferation of such rules, a unified convergence analysis framework that accommodates general data heterogeneity is lacking. In this work, we provide a thorough convergence theory of Byzantine-robust distributed stochastic gradient descent (SGD), analyzing variants both with and without local momentum. We establish the convergence rates for nonconvex smooth objectives and those satisfying the Polyak-Lojasiewicz condition under a general data heterogeneity assumption. Our analysis reveals that while stochasticity and data heterogeneity introduce unavoidable error floors, local momentum provably reduces the error component induced by stochasticity. Furthermore, we derive matching lower bounds to demonstrate that the upper bounds obtained in our analysis are tight and characterize the fundamental limits of Byzantine resilience under stochasticity and data heterogeneity. Empirical results support our theoretical findings.