🤖 AI Summary
This work investigates the continuous-time limiting behavior and synchronization mechanisms of token evolution in Transformers with finite depth and width. By establishing pathwise convergence, the token dynamics within MLP blocks are mapped to a continuous-time stochastic interacting particle system, yielding a stochastic partial differential equation that governs the evolution of the token distribution. The study provides the first rigorous proof that inter-layer dynamics exhibit propagation of chaos in the large-token-number regime, uncovering noise-induced synchronization and identifying conditions for exponential energy dissipation. Under a noise coercivity assumption on the activation function, an exchangeable scaling limit framework is constructed, and it is shown that in the strong common-noise regime, the system’s mean interaction energy decays exponentially.
📝 Abstract
We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.