Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the design bottleneck of layer normalization (LN) placement in large-scale Transformer training. We systematically propose and validate peripheral LN (Peri-LN), a novel paradigm that applies LN outside—rather than inside—the sublayer computation. Unlike dominant Pre-LN and Post-LN configurations, Peri-LN is the first LN placement strategy theoretically proven to yield milder variance growth across layers, more balanced gradient propagation, and more stable activation distributions. Through rigorous theoretical analysis and large-scale empirical evaluation on a 3.2B-parameter model, we uncover Peri-LN’s intrinsic mechanisms for mitigating activation explosion and gradient vanishing, thereby enhancing training stability and accelerating convergence. Our work establishes Peri-LN as a principled third alternative in the LN placement taxonomy, filling a critical theoretical and practical gap in Transformer architecture design.

Technology Category

Application Category

📝 Abstract

Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures large-scale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth -- unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.'' To validate our theoretical insight, we conduct large-scale experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.

Problem

Research questions and friction points this paper is trying to address.

Optimizing layer normalization in Transformers

Ensuring training stability and convergence

Exploring Peri-LN's mechanisms and benefits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Peri-LN optimizes layer normalization placement

Balances variance growth for stable training

Enhances gradient flow and convergence stability

🔎 Similar Papers

Flash normalization: fast normalization for LLMs