π€ AI Summary
This work uncovers the staged learning dynamics of attention mechanisms during Transformer training and formally introduces, for the first time, a cyclic βfocusβdilutionβ alternation pattern. Focusing on single-layer Transformers trained on Markovian data, the study combines gradient flow analysis, stage-wise linearization, and perturbation theory around critical points to systematically characterize the mathematical structure and evolution of each learning phase. Experiments on synthetic Markov sequences, WikiText, and TinyStories consistently validate the ubiquity of this multi-stage cyclic behavior, with empirical observations aligning closely with theoretical predictions. These findings offer a novel perspective on the mechanistic underpinnings of how attention mechanisms learn throughout training.
π Abstract
Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.