🤖 AI Summary
This paper investigates the training and generalization performance of two-layer neural networks trained via one-step gradient descent under Gaussian mixture models (GMM), moving beyond conventional isotropic assumptions to account for the structured covariance of real-world data.
Method: Under the asymptotic regime where input dimension $d$, hidden-layer width $m$, and sample size $n$ scale proportionally, we integrate Gaussian universality, random matrix theory, and asymptotic statistical inference to rigorously analyze the learned model.
Contribution/Results: We establish, for the first time, that the one-step trained network is asymptotically equivalent to a polynomial kernel machine, whose degree is jointly determined by the data’s covariance structure (scattering) and the learning rate. The theoretical characterization precisely captures the asymptotic error behavior. This polynomial equivalence holds robustly across regression, classification, and Fashion-MNIST experiments, and transfers to real image classification tasks—enhancing model interpretability and providing principled guidance for architecture design.
📝 Abstract
In this work, we study the training and generalization performance of two-layer neural networks (NNs) after one gradient descent step under structured data modeled by Gaussian mixtures. While previous research has extensively analyzed this model under isotropic data assumption, such simplifications overlook the complexities inherent in real-world datasets. Our work addresses this limitation by analyzing two-layer NNs under Gaussian mixture data assumption in the asymptotically proportional limit, where the input dimension, number of hidden neurons, and sample size grow with finite ratios. We characterize the training and generalization errors by leveraging recent advancements in Gaussian universality. Specifically, we prove that a high-order polynomial model performs equivalent to the nonlinear neural networks under certain conditions. The degree of the equivalent model is intricately linked to both the"data spread"and the learning rate employed during one gradient step. Through extensive simulations, we demonstrate the equivalence between the original model and its polynomial counterpart across various regression and classification tasks. Additionally, we explore how different properties of Gaussian mixtures affect learning outcomes. Finally, we illustrate experimental results on Fashion-MNIST classification, indicating that our findings can translate to realistic data.