🤖 AI Summary
This work investigates the representation degeneration phenomenon in deep Transformers at initialization as network depth increases. By modeling the residual stream as a discrete interacting particle system on the unit sphere and incorporating appropriate rescalings of depth, step size, and the number of attention heads, the authors establish—for the first time—a deterministic and a stochastic homogenization theory with common noise in the infinite-depth limit. Within a mean-field framework, they derive a nonlinear Fokker–Planck equation governing the conditional distribution of representations and simplify the limiting drift term under a Gaussian ansatz. The resulting theory quantitatively characterizes the trade-offs among dimensionality, context length, and temperature, identifying parameter regimes that effectively mitigate representation collapse into clusters.
📝 Abstract
We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.