Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

📅 2023-02-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work investigates gradient optimization dynamics in wide, shallow neural networks under asymmetric node scaling—where each hidden node’s output is multiplied by a distinct positive scaling factor—focusing on global convergence and explicit feature learning, in contrast to the classical Neural Tangent Kernel (NTK) parameterization. Method: We combine dynamical systems analysis, high-probability generalization bound derivation, and empirical validation. Contribution/Results: We establish, for the first time, global convergence guarantees for gradient flow and descent under this non-homogeneous parameterization, proving that, with high probability, optimization converges to the global optimum in the large-width regime. We further reveal that asymmetric scaling breaks the NTK’s linearization constraint, enabling implicit feature learning. Experiments demonstrate that models trained under this scheme exhibit superior prunability and cross-task transferability, significantly outperforming NTK-based baselines.

📝 Abstract

We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.

Problem

Research questions and friction points this paper is trying to address.

Global convergence of gradient-based optimization

Feature learning in over-parameterized neural networks

Benefits of asymmetrical node scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetrical node scaling technique

Global convergence guarantees provided

Enhanced feature learning capability

🔎 Similar Papers

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization