Global $mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

📅 2023-11-27

📈 Citations: 2

✨ Influential: 2

career value

230K/year

🤖 AI Summary

To address the convergence challenge of global $mathcal{L}^2$ loss optimization in supervised deep neural network training, this paper proposes a geometrically adaptive gradient descent framework: the gradient flow is defined under the Euclidean metric in output space, inducing nontrivial Riemannian dynamics in parameter space. Theoretically, under standard over-parameterization and rank conditions, the method achieves uniform exponential convergence to global minima. It yields, for the first time, a computable *a priori* stopping time. Moreover, it reveals that the critical point set forms a low-dimensional submanifold—induced by rank degeneracy—rather than isolated points. The key innovation lies in establishing an output-metric-driven geometric perspective on optimization, unifying the characterization of convergence rates, stopping criteria, and critical geometry. This provides a rigorous geometric foundation for nonconvex optimization in deep learning.

📝 Abstract

We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.

Problem

Research questions and friction points this paper is trying to address.

Optimizing global L2 minimization in deep learning networks.

Adapting gradient descent for over and underparametrized settings.

Ensuring uniform exponential convergence to global minimum.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometrically adapted gradient descent in DL

Modified gradient flows for over/underparametrized settings

Uniform exponential convergence to global minimum

🔎 Similar Papers

Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad