Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

📅 2024-03-19
đŸ›ïž Communications on Pure and Applied Mathematics
📈 Citations: 3
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work investigates the global convergence of infinitely deep, arbitrarily wide residual networks (ResNets) under gradient flow training, addressing the challenging setting where the objective function is nonconvex and noncoercive. We propose a gradient flow modeling framework based on the conditional optimal transport (COT) distance, constructing a probability measure parametrization model subject to layerwise constant marginal constraints, and establish its well-posedness in the mean-field limit. We prove, for suitable initializations, that this COT gradient flow globally converges to a global minimizer for infinitely deep and wide ResNets—a first such result. Moreover, we demonstrate dynamical consistency between this infinite-width limit and the training dynamics of finite-width ResNets. Our analysis integrates tools from Wasserstein gradient flows, differential equations on probability measure spaces, and Polyak–Ɓojasiewicz inequality theory, thereby significantly extending the theoretical boundaries of mean-field deep learning.

Technology Category

Application Category

📝 Abstract
We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Ɓojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.
Problem

Research questions and friction points this paper is trying to address.

Analyzing gradient flow convergence in infinitely deep ResNets
Addressing non-convex optimization challenges in deep neural networks
Proving global minimizer convergence via conditional Optimal Transport
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gradient flow for deep ResNets training
Applies conditional Optimal Transport distance
Ensures convergence with Polyak-Ɓojasiewicz analysis
🔎 Similar Papers
No similar papers found.