🤖 AI Summary
Existing research lacks theoretical characterization of LayerNorm and feed-forward network (FFN) Hessians, hindering deep understanding of Transformer optimization landscapes. This paper derives, for the first time, closed-form Hessian expressions for the full Transformer block—including multi-head self-attention and FFN—and establishes a unified curvature propagation analysis framework. We propose a joint modeling approach integrating matrix calculus and generalized second-order gradients to precisely quantify each sublayer’s contribution to overall curvature. Furthermore, we develop a Taylor expansion–driven loss difference analysis to rigorously characterize optimization trajectories. Our theoretical results uncover intrinsic connections between curvature structure and large-model scaling laws, providing the first systematic second-order theoretical foundation for convergence analysis and performance prediction of large-scale Transformers.
📝 Abstract
The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.