🤖 AI Summary
This work investigates the convergence of the Gauss–Newton method for training feedforward neural networks with smooth activation functions, unifying the analysis of fast convergence—without explicit regularization—in both underparameterized and overparameterized regimes. Leveraging Riemannian optimization and embedded submanifold analysis, we show that the dynamics are equivalent to Riemannian gradient flow in the underparameterized case and to robust optimization with adaptive damping in the overparameterized case. We establish, for the first time, exponential convergence of the *final iterate*. We quantitatively characterize how neural scaling factors and initialization affect the convergence rate. Moreover, our analysis significantly enhances robustness to ill-conditioned neural tangent kernels—specifically, to small singular values. In the underparameterized setting, convergence is exponential and independent of the Gram matrix condition number; in the overparameterized setting, stable convergence is guaranteed even under severe ill-conditioning. These results provide a geometric theoretical foundation for efficient deep network training.
📝 Abstract
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions. In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional, smooth, embedded submanifold of the Euclidean output space. Using tools from Riemannian optimization, we prove emph{last-iterate} convergence of the Riemannian gradient flow to the optimal in-class predictor at an emph{exponential rate} that is independent of the conditioning of the Gram matrix, emph{without} requiring explicit regularization. We further characterize the critical impacts of the neural network scaling factor and the initialization on the convergence behavior. In the overparameterized regime, we show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels, analogous to the underparameterized regime. These findings demonstrate the potential of Gauss-Newton methods for efficiently optimizing neural networks, particularly in ill-conditioned problems where kernel and Gram matrices have small singular values.