Emergence of meta-stable clustering in mean-field transformer models

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the long-term behavior of token dynamics on the unit sphere in deep Transformers, focusing on the spontaneous emergence and persistence of metastable phases and clustering structures in the large-token limit. The authors model token evolution as a spherical mean-field particle system, derive the associated Wasserstein gradient flow PDE, and employ spherical harmonic analysis together with perturbation theory. They establish, for the first time, rigorous convergence of the dynamics to a metastable manifold endowed with periodic geometric structure. This manifold is explicitly characterized by the indices of maxima of Gegenbauer polynomials, and its phase transitions are analytically described as functions of the temperature parameter. The results provide the first mathematical foundation and quantitative characterization of implicit clustering in attention mechanisms.

Technology Category

Application Category

📝 Abstract
We model the evolution of tokens within a deep stack of Transformer layers as a continuous-time flow on the unit sphere, governed by a mean-field interacting particle system, building on the framework introduced in (Geshkovski et al., 2023). Studying the corresponding mean-field Partial Differential Equation (PDE), which can be interpreted as a Wasserstein gradient flow, in this paper we provide a mathematical investigation of the long-term behavior of this system, with a particular focus on the emergence and persistence of meta-stable phases and clustering phenomena, key elements in applications like next-token prediction. More specifically, we perform a perturbative analysis of the mean-field PDE around the iid uniform initialization and prove that, in the limit of large number of tokens, the model remains close to a meta-stable manifold of solutions with a given structure (e.g., periodicity). Further, the structure characterizing the meta-stable manifold is explicitly identified, as a function of the inverse temperature parameter of the model, by the index maximizing a certain rescaling of Gegenbauer polynomials.
Problem

Research questions and friction points this paper is trying to address.

Model token evolution in Transformers as continuous-time flow
Study long-term behavior and meta-stable clustering in PDE
Identify meta-stable manifold structure via Gegenbauer polynomials
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous-time flow modeling on unit sphere
Mean-field PDE as Wasserstein gradient flow
Perturbative analysis of meta-stable manifold
🔎 Similar Papers
No similar papers found.