Scaling Laws in Linear Regression: Compute, Parameters, and Data

📅 2024-06-12

🏛️ Neural Information Processing Systems

📈 Citations: 16

✨ Influential: 2

career value

217K/year

🤖 AI Summary

This work resolves a theoretical tension between neural scaling laws—which posit monotonic test error reduction with model size—and the classical bias-variance decomposition—which predicts increasing variance with parameter count. The authors analyze infinite-dimensional linear regression under a power-law spectral covariance structure and Gaussian prior, incorporating single-pass stochastic gradient descent (SGD) and its implicit regularization. For the first time, this framework theoretically reproduces neural scaling laws. They rigorously derive a reducible error bound of Θ(M^{−(a−1)} + N^{−(a−1)/a}), where M is model size and N is sample size, showing that implicit regularization dominates and suppresses the variance term—preventing its growth with model scale. Numerical experiments confirm the predicted power-law error decay. This provides a unified theoretical explanation for the persistent performance gains observed in increasingly large models.

Technology Category

Application Category

📝 Abstract

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.

Problem

Research questions and friction points this paper is trying to address.

Understand scaling laws in linear regression models

Resolve variance error increase with model size

Analyze test error reducible part components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinite dimensional linear regression analysis

One-pass SGD with sketched covariates

Implicit regularization reduces variance error

🔎 Similar Papers

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling