Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates the optimal generalization performance of wide, shallow two-layer neural networks near the interpolation threshold—where the number of parameters scales as the sample size (i.e., width (k propto d), and (n sim d^2)). Leveraging statistical physics mean-field analysis, random matrix theory, and Bayesian inference, we identify, for the first time, a discontinuous phase transition under binary weights: as the signal-to-noise ratio crosses a critical threshold, the system transitions from a “universal phase”—where generalization is independent of the weight prior—to a “specialized phase” governed by the prior. This reveals the existence of highly predictive yet optimization-intractable solutions within the interpolation regime. We derive an exact analytical expression for the Bayes-optimal generalization error for arbitrary activation functions, rigorously establish two distinct scaling laws—slow decay ((n/d^2)) and fast decay (alignment-dominated)—and characterize the asymptotic recovery of teacher weights by the student network in the specialized phase.

Technology Category

Application Category

📝 Abstract

We consider a teacher-student model of supervised learning with a fully-trained 2-layer neural network whose width $k$ and input dimension $d$ are large and proportional. We compute the Bayes-optimal generalisation error of the network for any activation function in the regime where the number of training data $n$ scales quadratically with the input dimension, i.e., around the interpolation threshold where the number of trainable parameters $kd+k$ and of data points $n$ are comparable. Our analysis tackles generic weight distributions. Focusing on binary weights, we uncover a discontinuous phase transition separating a"universal"phase from a"specialisation"phase. In the first, the generalisation error is independent of the weight distribution and decays slowly with the sampling rate $n/d^2$, with the student learning only some non-linear combinations of the teacher weights. In the latter, the error is weight distribution-dependent and decays faster due to the alignment of the student towards the teacher network. We thus unveil the existence of a highly predictive solution near interpolation, which is however potentially hard to find.

Problem

Research questions and friction points this paper is trying to address.

Wide Shallow Neural Networks

Weight Distributions

Phase Transitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Student Model

Discontinuous Phase Transition

Binary Weights

🔎 Similar Papers

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers