🤖 AI Summary
This work addresses the performance bottleneck of scaling laws for linear regression under data-limited regimes. We propose multi-pass stochastic gradient descent (SGD) to enable finite-sample reuse, thereby breaking the generalization error lower bound inherent to single-pass training. Theoretically, we derive the first tight asymptotic error bound for multi-pass SGD in sketched-feature linear models under power-law spectral decay of the covariance matrix and standard parameter prior assumptions. We prove that the test error improves from the conventional $N^{(1-b)/a}$ to $L^{(1-b)/a}$, where $L > N$ denotes the total number of gradient updates, yielding an enhanced scaling law $Theta(M^{1-b} + L^{(1-b)/a})$. Numerical experiments confirm the predicted acceleration in error decay. Our key contribution lies in uncovering the fundamental mechanism by which sample reuse enhances scaling behavior, providing rigorous theoretical foundations for efficient training in small-data settings.
📝 Abstract
Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a>b>1$), we show that multi-pass SGD achieves a test error of $Theta(M^{1-b} + L^{(1-b)/a})$, where $L lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.