A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

📅 2023-10-11

🏛️ International Conference on Machine Learning

📈 Citations: 19

✨ Influential: 4

career value

260K/year

🤖 AI Summary

In two-layer fully connected networks, constant learning rates in single-step gradient descent only capture the linear component of the target function, failing to learn nonlinear features. Method: We propose a sample-size-dependent learning rate schedule that enables effective learning of polynomial nonlinearities in a single update. Leveraging high-dimensional random matrix theory and non-asymptotic spectral analysis, we rigorously characterize multiple rank-one “feature peaks” induced by this schedule—each corresponding to a distinct nonlinear order—and quantify their hierarchical contributions to generalization error. Results: We theoretically prove that incorporating such nonlinear features significantly reduces both training and test errors, breaking the limitations of conventional linearized analysis. Our framework provides the first rigorous characterization of early-stage nonlinear feature learning in deep networks, establishing a new paradigm for understanding how depth enables rapid acquisition of structured representations.

📝 Abstract

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

Problem

Research questions and friction points this paper is trying to address.

Studies non-linear feature learning in two-layer neural networks

Analyzes gradient step impact on feature matrix spectrum

Demonstrates learning rate scaling enables polynomial feature extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Growing learning rate enables non-linear feature learning

Multiple rank-one components capture polynomial features

Spikes characterize training and test errors precisely

🔎 Similar Papers

Artificial neural networks on graded vector spaces