Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the inherent bias of constant-stepsize stochastic gradient descent (SGD) in finite-sum minimax optimization and variational inequality problems, which typically limits convergence to a neighborhood of the solution with constant error. For the first time in structured non-monotone variational inequalities, the authors theoretically demonstrate that combining Random Reshuffling with Richardson–Romberg extrapolation synergistically eliminates this bias, yielding improved mean-square error and third-order decay of the bias term. By modeling resampling noise via continuous-state Markov chains and analyzing the debiasing mechanism of extrapolation on biased gradient oracles through spectral tensor techniques, they establish novel laws of large numbers and central limit theorems. Empirical results confirm that this combined approach significantly accelerates convergence.

Technology Category

Application Category

📝 Abstract

From adversarial robustness to multi-agent learning, many machine learning tasks can be cast as finite-sum min-max optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a constant term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emph{Random Reshuffling} of data and \emph{Richardson--Romberg extrapolation} across iterates. Random Reshuffling sharpens the mean-squared error (MSE) of the estimated solution, while Richardson-Romberg extrapolation acts orthogonally, providing a second-order reduction in its bias. In this work, we show that their composition is strictly better than both, not only maintaining the enhanced MSE guarantees but also yielding an even greater cubic refinement in the bias. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non-monotone VIPs. Our analysis proceeds in two steps: (i) we smooth the discrete noise induced by reshuffling and leverage tools from continuous-state Markov chain theory to establish a novel law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation debiases and sharpens the asymptotic behavior even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.

Problem

Research questions and friction points this paper is trying to address.

stochastic gradient descent

bias reduction

variational inequality problems

constant step-size

finite-sum min-max optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Random Reshuffling

Richardson–Romberg extrapolation

constant step-size SGD