Improved Scaling Laws in Linear Regression via Data Reuse

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of scaling laws for linear regression under data-limited regimes. We propose multi-pass stochastic gradient descent (SGD) to enable finite-sample reuse, thereby breaking the generalization error lower bound inherent to single-pass training. Theoretically, we derive the first tight asymptotic error bound for multi-pass SGD in sketched-feature linear models under power-law spectral decay of the covariance matrix and standard parameter prior assumptions. We prove that the test error improves from the conventional $N^{(1-b)/a}$ to $L^{(1-b)/a}$, where $L > N$ denotes the total number of gradient updates, yielding an enhanced scaling law $Theta(M^{1-b} + L^{(1-b)/a})$. Numerical experiments confirm the predicted acceleration in error decay. Our key contribution lies in uncovering the fundamental mechanism by which sample reuse enhances scaling behavior, providing rigorous theoretical foundations for efficient training in small-data settings.

Technology Category

Application Category

📝 Abstract
Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a>b>1$), we show that multi-pass SGD achieves a test error of $Theta(M^{1-b} + L^{(1-b)/a})$, where $L lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.
Problem

Research questions and friction points this paper is trying to address.

Improving scaling laws via data reuse in linear regression
Analyzing test error bounds for multi-pass SGD on sketched features
Comparing performance of multi-pass vs one-pass SGD in data-constrained regimes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data reuse improves linear regression scaling
Multi-pass SGD sharpens test error bounds
Power-law spectrum optimizes model performance
🔎 Similar Papers
No similar papers found.