Focus and Dilution: The Multi-stage Learning Process of Attention

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work uncovers the staged learning dynamics of attention mechanisms during Transformer training and formally introduces, for the first time, a cyclic “focus–dilution” alternation pattern. Focusing on single-layer Transformers trained on Markovian data, the study combines gradient flow analysis, stage-wise linearization, and perturbation theory around critical points to systematically characterize the mathematical structure and evolution of each learning phase. Experiments on synthetic Markov sequences, WikiText, and TinyStories consistently validate the ubiquity of this multi-stage cyclic behavior, with empirical observations aligning closely with theoretical predictions. These findings offer a novel perspective on the mechanistic underpinnings of how attention mechanisms learn throughout training.

📝 Abstract

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.

Problem

Research questions and friction points this paper is trying to address.

attention dynamics

Transformer training

focus-dilution cycle

gradient flow

Markovian data

Innovation

Methods, ideas, or system contributions that make the work stand out.

focus-dilution cycle

attention dynamics

gradient-flow analysis