Clustering in Deep Stochastic Transformers

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing theoretical analyses of deep Transformers, which assume deterministic weights and erroneously predict that all tokens collapse to a single point, thereby neglecting the intrinsic noise introduced by random initialization. The authors model the token dynamics in deep randomly initialized Transformers as an interacting particle system on the sphere, driven by a common matrix Brownian noise, and analyze it through diffusion limits combined with RMS normalization. They establish for the first time that initialization noise effectively prevents complete token clustering and uncover a phase transition in two-token systems governed by interaction strength and dimensionality, wherein antipodal configurations emerge as attractive states with positive probability. Both theoretical analysis and numerical experiments confirm that this antipodal clustering persists in multi-token settings and that suppressing initialization noise significantly degrades model accuracy.

Technology Category

Application Category

📝 Abstract
Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.
Problem

Research questions and friction points this paper is trying to address.

clustering
stochasticity
random initialization
token dynamics
deep Transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

stochastic initialization
interacting-particle system
phase transition
antipodal configuration
diffusion scaling
🔎 Similar Papers
No similar papers found.
L
Lev Fedorov
New York University
Michael E. Sander
Michael E. Sander
Google DeepMind
Machine LearningApplied Mathematics
R
R. Élie
Google DeepMind
Pierre Marion
Pierre Marion
Inria - Ecole Normale Supérieure
Machine LearningDeep Learning
M
Mathieu Lauriere
New York University