Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the optimal generalization performance of wide, shallow two-layer neural networks near the interpolation threshold—where the number of parameters scales as the sample size (i.e., width (k propto d), and (n sim d^2)). Leveraging statistical physics mean-field analysis, random matrix theory, and Bayesian inference, we identify, for the first time, a discontinuous phase transition under binary weights: as the signal-to-noise ratio crosses a critical threshold, the system transitions from a “universal phase”—where generalization is independent of the weight prior—to a “specialized phase” governed by the prior. This reveals the existence of highly predictive yet optimization-intractable solutions within the interpolation regime. We derive an exact analytical expression for the Bayes-optimal generalization error for arbitrary activation functions, rigorously establish two distinct scaling laws—slow decay ((n/d^2)) and fast decay (alignment-dominated)—and characterize the asymptotic recovery of teacher weights by the student network in the specialized phase.

Technology Category

Application Category

📝 Abstract
We consider a teacher-student model of supervised learning with a fully-trained 2-layer neural network whose width $k$ and input dimension $d$ are large and proportional. We compute the Bayes-optimal generalisation error of the network for any activation function in the regime where the number of training data $n$ scales quadratically with the input dimension, i.e., around the interpolation threshold where the number of trainable parameters $kd+k$ and of data points $n$ are comparable. Our analysis tackles generic weight distributions. Focusing on binary weights, we uncover a discontinuous phase transition separating a"universal"phase from a"specialisation"phase. In the first, the generalisation error is independent of the weight distribution and decays slowly with the sampling rate $n/d^2$, with the student learning only some non-linear combinations of the teacher weights. In the latter, the error is weight distribution-dependent and decays faster due to the alignment of the student towards the teacher network. We thus unveil the existence of a highly predictive solution near interpolation, which is however potentially hard to find.
Problem

Research questions and friction points this paper is trying to address.

Wide Shallow Neural Networks
Weight Distributions
Phase Transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Student Model
Discontinuous Phase Transition
Binary Weights
🔎 Similar Papers
No similar papers found.
Jean Barbier
Jean Barbier
Associate Professor, International Center for Theoretical Physics
high-dimensional statisticsmachine learninginformation theoryspin glassesrandom matrices
F
Francesco Camilli
The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy
M
Minh-Toan Nguyen
The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy
M
Mauro Pastore
The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy
R
Rudy Skerk
International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy