(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

📅 2024-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the acceleration mechanism and noise robustness of Stochastic Heavy Ball (SHB) in strongly convex quadratic optimization. Addressing the open question of whether SHB achieves accelerated convergence and how it can adapt to noise to ensure convergence to the optimum, we first establish that SHB attains an accelerated rate of $O(exp(-T/sqrt{kappa}) + sigma)$ when the mini-batch size exceeds a critical threshold. Building on this, we propose the first noise-variance-adaptive multi-stage SHB algorithm, which provably converges to the minimizer for strongly convex smooth functions. Theoretically, it achieves dual convergence rates: $O(exp(-T/sqrt{kappa}) + sigma/sqrt{T})$ and $O(exp(-T/kappa) + sigma^2/T)$. Extensive experiments demonstrate that our method significantly outperforms standard SHB and SGD in both convergence speed and solution accuracy under stochastic noise.

Technology Category

Application Category

📝 Abstract
Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size $1$) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold $b^*$ that depends on the condition number $kappa$. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an $Oleft(exp(-frac{T}{sqrt{kappa}}) + sigma ight)$ convergence when measuring the distance to the optimal solution in the $ell_2$ norm, where $T$ is the number of iterations and $sigma^2$ is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a $kappa$ dependence in $b^*$ is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an $Oleft(expleft(-frac{T}{sqrt{kappa}} ight) + frac{sigma}{sqrt{T}} ight)$ rate when measuring the distance to the optimal solution in the $ell_2$ norm. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an $O(exp(-frac{T}{kappa}) + frac{sigma^2}{T})$ rate when measuring the distance to the optimal solution in the squared $ell_2$ norm. We empirically demonstrate the effectiveness of the proposed algorithms.
Problem

Research questions and friction points this paper is trying to address.

Understanding theoretical advantages of Stochastic Heavy-Ball Momentum (SHB)
Improving SHB convergence rates with adaptive mini-batch sizes
Designing noise-adaptive SHB algorithms for optimal convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive SHB with large mini-batches for acceleration
Noise-adaptive multi-stage algorithm for minimizer convergence
First noise-adaptive SHB variant for strong convexity