🤖 AI Summary
In distributed machine learning, gradient estimates often suffer from bias due to compression, clipping, meta-learning, or other system-level approximations, rendering the convergence properties of momentum-based algorithms theoretically unclear—especially for non-convex and μ-PL objectives.
Method: This paper establishes the first worst-case convergence bounds for parallel momentum methods under biased gradient estimates, covering both general non-convex and μ-PL settings. It unifies multiple bias sources via a generic bias model and integrates tools from non-convex optimization, PL-condition theory, and distributed stochastic algorithm analysis to rigorously prove that momentum preserves convergence despite bias, yielding explicit convergence rates.
Results: Experiments confirm that the analyzed momentum method significantly accelerates convergence over standard biased gradient descent. The core contribution is the first general convergence theory for parallel momentum under biased gradients—bridging the long-standing theoretical gap between momentum acceleration and practical system-induced gradient biases.
📝 Abstract
Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and $mu$-PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.