🤖 AI Summary
This work investigates the token dynamics of encoder-only Transformers during inference under moderate interaction strength, focusing on the joint limit where the number of tokens $N$ grows large and the inverse temperature parameter scales as $eta propto N$. Modeling tokens as a mean-field interacting particle system, we rigorously identify a three-stage, multi-scale depth evolution: rapid dimensional reduction, cluster formation, and asymptotic cluster merging. Using probabilistic measure evolution analysis and asymptotic methods, we derive precise mathematical characterizations of the limiting dynamics in each stage and prove convergence to a single consensus cluster. Our theoretical findings are thoroughly validated by numerical experiments. This constitutes the first multi-scale dynamical framework for Transformer internal representation evolution with rigorous convergence guarantees.
📝 Abstract
In this paper, we study the evolution of tokens through the depth of encoder-only transformer models at inference time by modeling them as a system of particles interacting in a mean-field way and studying the corresponding dynamics. More specifically, we consider this problem in the moderate interaction regime, where the number $N$ of tokens is large and the inverse temperature parameter $β$ of the model scales together with $N$. In this regime, the dynamics of the system displays a multiscale behavior: a fast phase, where the token empirical measure collapses on a low-dimensional space, an intermediate phase, where the measure further collapses into clusters, and a slow one, where such clusters sequentially merge into a single one. We provide a rigorous characterization of the limiting dynamics in each of these phases and prove convergence in the above mentioned limit, exemplifying our results with some simulations.