From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing scaling law theories lack a systematic characterization of the generalization error of small-batch stochastic gradient descent (SGD) under varying data reuse strategies, particularly overlooking the role of batch size. This work addresses this gap by introducing batch size as a fundamental parameter alongside compute budget, dataset size, and model dimension within a sketching-based linear regression framework. Leveraging power-law covariance spectra and source conditions, the authors derive scaling laws for both single-pass and multi-pass small-batch SGD through risk decomposition, gradient flow trajectory analysis, and stochastic process techniques. The analysis reveals that in single-pass SGD, variance is jointly governed by batch size and effective iteration count, while multi-pass sampling without replacement substantially reduces fluctuations; notably, when the batch size equals the sample size, the algorithm degenerates to deterministic gradient descent.

📝 Abstract

Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as $O(\min(M,(T_{\mathrm{eff}}γ)^{1/a})/(B T_{\mathrm{eff}}))$. Thus the usual $1/B$ covariance reduction holds at fixed update count $T$, but in the one-pass regime $T=N/B$ it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is $1/B$ with replacement and $ρ_{N,B}=(N-B)/(B(N-1))$ without replacement. Hence without-replacement sampling is less noisy for $B>1$, and when $B=N$ the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.

Problem

Research questions and friction points this paper is trying to address.

scaling laws

mini-batch SGD

sketched linear regression

data reuse

stochastic optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

mini-batch SGD

scaling laws

sketched linear regression