🤖 AI Summary
This work investigates gradient optimization dynamics in wide, shallow neural networks under asymmetric node scaling—where each hidden node’s output is multiplied by a distinct positive scaling factor—focusing on global convergence and explicit feature learning, in contrast to the classical Neural Tangent Kernel (NTK) parameterization.
Method: We combine dynamical systems analysis, high-probability generalization bound derivation, and empirical validation.
Contribution/Results: We establish, for the first time, global convergence guarantees for gradient flow and descent under this non-homogeneous parameterization, proving that, with high probability, optimization converges to the global optimum in the large-width regime. We further reveal that asymmetric scaling breaks the NTK’s linearization constraint, enabling implicit feature learning. Experiments demonstrate that models trained under this scheme exhibit superior prunability and cross-task transferability, significantly outperforming NTK-based baselines.
📝 Abstract
We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.