đ€ AI Summary
This work investigates the global convergence of infinitely deep, arbitrarily wide residual networks (ResNets) under gradient flow training, addressing the challenging setting where the objective function is nonconvex and noncoercive. We propose a gradient flow modeling framework based on the conditional optimal transport (COT) distance, constructing a probability measure parametrization model subject to layerwise constant marginal constraints, and establish its well-posedness in the mean-field limit. We prove, for suitable initializations, that this COT gradient flow globally converges to a global minimizer for infinitely deep and wide ResNetsâa first such result. Moreover, we demonstrate dynamical consistency between this infinite-width limit and the training dynamics of finite-width ResNets. Our analysis integrates tools from Wasserstein gradient flows, differential equations on probability measure spaces, and PolyakâĆojasiewicz inequality theory, thereby significantly extending the theoretical boundaries of mean-field deep learning.
đ Abstract
We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the nonâconvexity and the nonâcoercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a âmeanâfieldâ model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the wellâposedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local PolyakâĆojasiewicz analysis, we then show convergence of the gradient flow for wellâchosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.