Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work uncovers a unified mechanism underlying in-context learning and repetitive generation in large language models. Under the assumptions that inputs are stationary, ergodic, and follow an elliptical distribution, the authors prove that the Softmax self-attention output converges to a linear readout of the input covariance matrix. By leveraging limit theory from stochastic processes and modeling Transformer dynamics, the study demonstrates for the first time that both phenomena stem from self-attention’s extraction of second-order input statistics: single-head attention is equivalent to one step of population gradient descent, and stacking layers enables multi-step optimization. Furthermore, deep propagation causes the generative process to degenerate into a first-order Markov chain, thereby explaining mode collapse.
📝 Abstract
Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_VΣΘ_K^{\top}Θ_Q x_t$, where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.
Problem

Research questions and friction points this paper is trying to address.

in-context learning
repetitive generation
self-attention
covariance readout
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-attention
covariance readout
in-context learning
repetition
transformer dynamics
🔎 Similar Papers