Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work elucidates how weight decay regularizes the loss landscape of Transformers to ensure both optimization convergence and generalization. It introduces, for the first time in deep learning, the Villani coercivity condition from functional analysis, rigorously proving that the L²-regularized Transformer loss satisfies this condition. This establishes an explicit theoretical link among weight decay, the geometric structure of the loss landscape, PAC-Bayes generalization bounds, and the convergence of Langevin dynamics. The analysis integrates log-Sobolev inequalities, spectral methods, and Hutchinson trace estimation, yielding a scalable diagnostic quantity Ψₛ(θ). Experiments on GPT-Neo-125M confirm the predicted quadratic growth of Ψₛ, Hessian spectral expansion, and exponential convergence, demonstrating that weight decay promotes rapid mixing and curvature-aware optimization.

📝 Abstract

Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^2$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $\mathcal{F}$ is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition $-Δ\mathcal{F} + \tfrac{1}{s}\|\nabla\mathcal{F}\|^{2} \to \infty$ as $\|θ\| \to \infty$ for all $s>0$. From this structure, we derive explicit log-Sobolev and Poincaré constants $C_{\mathrm{LS}} \leq λ^{-1} + d/λ^{2}$, linking the regularization strength $λ$ and model dimension $d$ to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing $λ$. To validate our theory, we introduce a scalable Villani diagnostic $Ψ_s(θ) = -Δ\mathcal{F} + s^{-1}\|\nabla \mathcal{F}\|^2$ and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of $Ψ_s$, spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.

Problem

Research questions and friction points this paper is trying to address.

weight decay

loss landscape

Transformer

regularization

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

weight decay

Villani criterion

log-Sobolev inequality