Normalization in Attention Dynamics

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how normalization schemes govern the evolution of token representations in deep Transformers, focusing on representation clustering dynamics and representation collapse. We propose a differential-geometric modeling framework grounded in spherical particle dynamics, formalizing inter-layer representation propagation as an interacting particle system on the unit sphere; this reveals normalization’s role as a “velocity regulator” in attention dynamics. We conduct a unified theoretical analysis of six normalization variants—Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling—characterizing their distinct impacts on representation structure. Our analysis identifies Peri-LN as optimal in balancing convergence speed and representation diversity, effectively mitigating deep-layer collapse. Empirical evaluation confirms Peri-LN’s superior generalization across language modeling and diverse downstream tasks. The study provides principled, geometry-informed guidance for normalization design in Transformer architectures.

Technology Category

Application Category

📝 Abstract
We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.
Problem

Research questions and friction points this paper is trying to address.

Analyzing normalization effects on token representations
Modeling attention dynamics as interacting spherical particles
Comparing normalization schemes to prevent representation collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling token evolution as interacting particles
Analyzing normalization as speed regulation mechanism
Identifying Peri-LN as optimal normalization scheme
🔎 Similar Papers
No similar papers found.
N
Nikita Karagodin
Department of EECS, MIT, Cambridge, MA, USA
S
Shu Ge
Department of Mathematics, MIT, Cambridge, MA, USA
Y
Yury Polyanskiy
Department of EECS, MIT, Cambridge, MA, USA
Philippe Rigollet
Philippe Rigollet
Massachusetts Institute of Technology
StatisticsMachine LearningOptimal Transport