๐ค AI Summary
This paper investigates efficient learning of the hidden subspace in two-layer neural networks under the Gaussian multi-index model. Addressing the open question of whether standard gradient descent can achieve agnostic representation learning at information-theoretic optimal complexity, we provide the first affirmative answer: layer-wise gradient descent recovers a one-dimensional to low-dimensional hidden subspace with only $widetilde{mathcal{O}}(d)$ samples and $widetilde{mathcal{O}}(d^2)$ time, attaining $o_d(1)$ test errorโmatching the leading-order information-theoretic limits in both sample and computational complexity. Our key technical contribution is the identification that the first-layer training requires a super-constant number of iterations to overcome a fundamental performance bottleneck. Through power iteration analysis, we rigorously characterize how gradient descent implicitly performs spectral initialization and noise suppression, thereby enabling provably optimal subspace recovery.
๐ Abstract
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(oldsymbol{x})=g(oldsymbol{U}oldsymbol{x})$ with hidden subspace $oldsymbol{U}in mathbb{R}^{r imes d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $widetilde{mathcal{O}}(d)$ samples and $widetilde{mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.