How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

📅 2024-02-09

🏛️ International Conference on Machine Learning

📈 Citations: 1

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Why do overparameterized neural networks generalize well despite achieving zero training loss (perfect interpolation), especially when weights are sampled from a uniform prior—suggesting generalization is independent of model size? Method: Leveraging tools from statistical learning theory, functional-space analysis, and high-dimensional probability, we rigorously characterize how structural redundancy induces a non-uniform *function-space prior* from a uniform *parameter-space prior*, inherently biasing toward low-complexity functions. Contribution: We provide the first rigorous proof that generalization performance is governed by the number of *non-redundant parameters* in the teacher network—not the size of the student network. We establish an upper bound on sample complexity that scales linearly with teacher complexity. Furthermore, we explain why random interpolators achieve generalization comparable to SGD solutions, revealing that an *implicit functional prior*—emerging from architecture-induced redundancy—is the key mechanism underlying generalization in overparameterized regimes.

Technology Category

Application Category

📝 Abstract

Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.

Problem

Research questions and friction points this paper is trying to address.

Neural Networks generalize well

Uniform random weights induce bias

Narrow teacher enables efficient learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uniform random weights induce bias

Narrow teacher NN enables generalization

Redundant NN structure simplifies functions

🔎 Similar Papers

Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks