Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing research lacks theoretical characterization of LayerNorm and feed-forward network (FFN) Hessians, hindering deep understanding of Transformer optimization landscapes. This paper derives, for the first time, closed-form Hessian expressions for the full Transformer block—including multi-head self-attention and FFN—and establishes a unified curvature propagation analysis framework. We propose a joint modeling approach integrating matrix calculus and generalized second-order gradients to precisely quantify each sublayer’s contribution to overall curvature. Furthermore, we develop a Taylor expansion–driven loss difference analysis to rigorously characterize optimization trajectories. Our theoretical results uncover intrinsic connections between curvature structure and large-model scaling laws, providing the first systematic second-order theoretical foundation for convergence analysis and performance prediction of large-scale Transformers.

Technology Category

Application Category

📝 Abstract

The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.

Problem

Research questions and friction points this paper is trying to address.

Characterizing full Transformer block Hessians to close theoretical gaps

Analyzing curvature propagation effects on convergence and scaling laws

Establishing theoretical foundation for optimization in large-scale deep learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derived explicit second-order expressions for Layer Normalization

Completed Hessian characterization of full Transformer blocks

Proposed Taylor-expansion framework for analyzing convergence trajectories

🔎 Similar Papers

Approximation Rate of the Transformer Architecture for Sequence Modeling