Improved Scaling Laws in Linear Regression via Data Reuse

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the performance bottleneck of scaling laws for linear regression under data-limited regimes. We propose multi-pass stochastic gradient descent (SGD) to enable finite-sample reuse, thereby breaking the generalization error lower bound inherent to single-pass training. Theoretically, we derive the first tight asymptotic error bound for multi-pass SGD in sketched-feature linear models under power-law spectral decay of the covariance matrix and standard parameter prior assumptions. We prove that the test error improves from the conventional $N^{(1-b)/a}$ to $L^{(1-b)/a}$, where $L > N$ denotes the total number of gradient updates, yielding an enhanced scaling law $Theta(M^{1-b} + L^{(1-b)/a})$. Numerical experiments confirm the predicted acceleration in error decay. Our key contribution lies in uncovering the fundamental mechanism by which sample reuse enhances scaling behavior, providing rigorous theoretical foundations for efficient training in small-data settings.

Technology Category

Application Category

📝 Abstract

Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a>b>1$), we show that multi-pass SGD achieves a test error of $Theta(M^{1-b} + L^{(1-b)/a})$, where $L lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Improving scaling laws via data reuse in linear regression

Analyzing test error bounds for multi-pass SGD on sketched features

Comparing performance of multi-pass vs one-pass SGD in data-constrained regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data reuse improves linear regression scaling

Multi-pass SGD sharpens test error bounds

Power-law spectrum optimizes model performance

🔎 Similar Papers

Transfer Learning in ℓ1 Regularized Regression: Hyperparameter Selection Strategy based on Sharp Asymptotic Analysis