Homogenized Transformers

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work investigates the representation degeneration phenomenon in deep Transformers at initialization as network depth increases. By modeling the residual stream as a discrete interacting particle system on the unit sphere and incorporating appropriate rescalings of depth, step size, and the number of attention heads, the authors establish—for the first time—a deterministic and a stochastic homogenization theory with common noise in the infinite-depth limit. Within a mean-field framework, they derive a nonlinear Fokker–Planck equation governing the conditional distribution of representations and simplify the limiting drift term under a Gaussian ansatz. The resulting theory quantitatively characterizes the trade-offs among dimensionality, context length, and temperature, identifying parameter regimes that effectively mitigate representation collapse into clusters.

Technology Category

Application Category

📝 Abstract

We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.

Problem

Research questions and friction points this paper is trying to address.

representation collapse

multi-head self-attention

homogenized limit

Fokker--Planck equation

mean-field regime

Innovation

Methods, ideas, or system contributions that make the work stand out.

Homogenized Transformers

Interacting particle system

Mean-field limit