Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

๐Ÿ“… 2025-11-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper investigates efficient learning of the hidden subspace in two-layer neural networks under the Gaussian multi-index model. Addressing the open question of whether standard gradient descent can achieve agnostic representation learning at information-theoretic optimal complexity, we provide the first affirmative answer: layer-wise gradient descent recovers a one-dimensional to low-dimensional hidden subspace with only $widetilde{mathcal{O}}(d)$ samples and $widetilde{mathcal{O}}(d^2)$ time, attaining $o_d(1)$ test errorโ€”matching the leading-order information-theoretic limits in both sample and computational complexity. Our key technical contribution is the identification that the first-layer training requires a super-constant number of iterations to overcome a fundamental performance bottleneck. Through power iteration analysis, we rigorously characterize how gradient descent implicitly performs spectral initialization and noise suppression, thereby enabling provably optimal subspace recovery.

Technology Category

Application Category

๐Ÿ“ Abstract
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(oldsymbol{x})=g(oldsymbol{U}oldsymbol{x})$ with hidden subspace $oldsymbol{U}in mathbb{R}^{r imes d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $widetilde{mathcal{O}}(d)$ samples and $widetilde{mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.
Problem

Research questions and friction points this paper is trying to address.

Neural networks learn high-dimensional features efficiently from limited data
Two-layer networks achieve optimal sample complexity for multi-index models
Gradient descent implicitly performs spectral analysis to recover hidden subspaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-layer neural network with gradient descent
Power-iteration process for hidden subspace recovery
Optimal sample and time complexity alignment
๐Ÿ”Ž Similar Papers
No similar papers found.