🤖 AI Summary
Existing scaling law theories lack a systematic characterization of the generalization error of small-batch stochastic gradient descent (SGD) under varying data reuse strategies, particularly overlooking the role of batch size. This work addresses this gap by introducing batch size as a fundamental parameter alongside compute budget, dataset size, and model dimension within a sketching-based linear regression framework. Leveraging power-law covariance spectra and source conditions, the authors derive scaling laws for both single-pass and multi-pass small-batch SGD through risk decomposition, gradient flow trajectory analysis, and stochastic process techniques. The analysis reveals that in single-pass SGD, variance is jointly governed by batch size and effective iteration count, while multi-pass sampling without replacement substantially reduces fluctuations; notably, when the batch size equals the sample size, the algorithm degenerates to deterministic gradient descent.
📝 Abstract
Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as $O(\min(M,(T_{\mathrm{eff}}γ)^{1/a})/(B T_{\mathrm{eff}}))$. Thus the usual $1/B$ covariance reduction holds at fixed update count $T$, but in the one-pass regime $T=N/B$ it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is $1/B$ with replacement and $ρ_{N,B}=(N-B)/(B(N-1))$ without replacement. Hence without-replacement sampling is less noisy for $B>1$, and when $B=N$ the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.