Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

📅 2024-06-06

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates the error dynamics of randomly reshuffled mini-batch stochastic gradient descent (SGD) in least-squares regression. The core problem is that both training and generalization errors depend critically on the cross-covariance matrix $Z$ between the original feature matrix $X$ and a dynamically corrected feature matrix $ ilde{X}$. To address this, the authors develop the first step-size–dominant-order theory establishing dynamical consistency between mini-batch and full-batch gradient descent. They further reveal that, under random reshuffling, SGD admits a step-size–dependent limiting point—a phenomenon invisible to continuous-time gradient flow analysis. Leveraging noncommutative random matrix theory and asymptotic spectral analysis, they quantitatively characterize the deviation of $Z$ from the sample covariance of $X$, proving that mini-batching induces spectral shrinkage—an intrinsic form of implicit regularization.

Technology Category

Application Category

📝 Abstract

We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.

Problem

Research questions and friction points this paper is trying to address.

Analyzes mini-batch gradient descent dynamics for least squares.

Explores training and generalization errors via cross-covariance matrix.

Demonstrates step size impact on convergence in batching.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mini-batch gradient descent dynamics

Random reshuffling impact analysis

Spectral shrinkage through batching

🔎 Similar Papers

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent