Stability of Transformers under Layer Normalization

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Transformer training instability is often attributed to the lack of theoretical guidance for layer normalization (LN) placement. Method: This paper establishes a unified theoretical framework that quantitatively characterizes how LN positioning affects the growth bound of forward hidden states and the evolution of backward gradient norms, thereby revealing its role in steering optimization toward benign or pathological solutions. Combining theoretical analysis with numerical experiments, we demonstrate that post-residual LN (Post-LN) suppresses state and gradient explosion, while residual scaling coefficients must be co-designed with LN placement. We derive a general stability criterion and use it to optimize scaling strategies. Contribution/Results: Empirical validation shows that principled joint configuration of LN position and scaling significantly improves training stability and final model performance. The framework is extensible to stability analysis of novel Transformer architectures.

Technology Category

Application Category

📝 Abstract

Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

Problem

Research questions and friction points this paper is trying to address.

Analyzing forward and backward stability in Transformers with layer normalization

Studying how normalization placement affects training dynamics and gradients

Providing theoretical framework to evaluate architectural modifications' stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed layer normalization placement effects on stability

Derived bounds for hidden states growth in Transformers

Guided residual step scaling to improve performance

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models