Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $mu$P Parametrization

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

218K/year

🤖 AI Summary

研究解决深度神经网络在无限宽度下如何实现全局收敛和丰富特征学习的问题，采用张量程序框架和最大更新参数化方法，通过随机梯度下降训练，确保特征线性独立并捕捉数据信息。

Technology Category

Application Category

📝 Abstract

Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

Problem

Research questions and friction points this paper is trying to address.

Understanding feature learning and global convergence in deep networks

Exploring training dynamics of infinite-width neural networks

Validating theoretical insights with real-world dataset experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Maximal Update parametrization (μP) for training

Leverages tensor program framework for analysis

Ensures global convergence with rich feature learning

🔎 Similar Papers

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization