🤖 AI Summary
This work investigates the optimal generalization performance of wide, shallow two-layer neural networks near the interpolation threshold—where the number of parameters scales as the sample size (i.e., width (k propto d), and (n sim d^2)). Leveraging statistical physics mean-field analysis, random matrix theory, and Bayesian inference, we identify, for the first time, a discontinuous phase transition under binary weights: as the signal-to-noise ratio crosses a critical threshold, the system transitions from a “universal phase”—where generalization is independent of the weight prior—to a “specialized phase” governed by the prior. This reveals the existence of highly predictive yet optimization-intractable solutions within the interpolation regime. We derive an exact analytical expression for the Bayes-optimal generalization error for arbitrary activation functions, rigorously establish two distinct scaling laws—slow decay ((n/d^2)) and fast decay (alignment-dominated)—and characterize the asymptotic recovery of teacher weights by the student network in the specialized phase.
📝 Abstract
We consider a teacher-student model of supervised learning with a fully-trained 2-layer neural network whose width $k$ and input dimension $d$ are large and proportional. We compute the Bayes-optimal generalisation error of the network for any activation function in the regime where the number of training data $n$ scales quadratically with the input dimension, i.e., around the interpolation threshold where the number of trainable parameters $kd+k$ and of data points $n$ are comparable. Our analysis tackles generic weight distributions. Focusing on binary weights, we uncover a discontinuous phase transition separating a"universal"phase from a"specialisation"phase. In the first, the generalisation error is independent of the weight distribution and decays slowly with the sampling rate $n/d^2$, with the student learning only some non-linear combinations of the teacher weights. In the latter, the error is weight distribution-dependent and decays faster due to the alignment of the student towards the teacher network. We thus unveil the existence of a highly predictive solution near interpolation, which is however potentially hard to find.