Training Infinitely Deep and Wide Transformers

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the lack of theoretical understanding of training dynamics in Transformers as both depth and width tend to infinity. Under the mean-field limit, the authors establish a rigorous mathematical framework for gradient-based training: forward propagation is modeled via a neural partial differential equation (PDE), the attention mechanism is characterized through a dual-measure representation, and a conditional Wasserstein gradient flow is derived using adjoint sensitivity analysis. This study provides the first formal connection between Transformer training and neural PDEs, establishes necessary and sufficient conditions for the invertibility of the attention-based neural tangent kernel (NTK), and proves that the gradient flow converges globally under small initial loss. The results guarantee well-posedness and convergence of infinitely deep and wide Transformer training, effectively eliminating spurious local minima in the optimization landscape.

📝 Abstract

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

Problem

Research questions and friction points this paper is trying to address.

Transformers

mean-field regime

gradient flow

Neural Tangent Kernel

optimization landscape

Innovation

Methods, ideas, or system contributions that make the work stand out.

mean-field limit

neural PDE

conditional Wasserstein gradient