How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates how attention mechanisms enhance the ability of sequence models to recover latent signals from noisy observations in the high-dimensional limit. Leveraging random matrix theory, the authors analyze the spectral properties of the sample covariance matrix after attention-weighted pooling, characterizing its eigenvalue distribution, outlier eigenvalues, and the alignment between eigenvectors and the true signal. The analysis reveals two BBP-type phase transitions governing signal recovery and demonstrates that optimal attention weights correspond to the leading eigenvector of a location-dependent kernel matrix. Moreover, a specific causal self-attention mechanism is shown to yield deterministic harmonic weights that outperform simple averaging. Combining free multiplicative convolution, high-dimensional spectral analysis, and Gaussian mixture embedding models, the proposed framework exhibits excellent agreement with finite-dimensional experiments, quantitatively elucidating the advantage of attention in boosting signal-to-noise ratio and recovery performance.

📝 Abstract

We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\to\infty$ with $d/V\toδ$ and $d/N\toγ$, we derive exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko--Pastur law given by the free multiplicative convolution $κ(MP_δ\boxtimes MP_γ)$, reflecting the finite vocabulary structure. Signal recovery undergoes two successive BBP-type phase transitions characterized by the scalars: $δ,γ,α=w^{\top} R w$ and $κ=\|w\|^2$, where $w$ denotes the attention pooling weights and $R$ the positional correlation matrix. An aftermath of our analysis demonstrates that the optimal attention weights maximizing the signal-to-noise ratio $α/κ$ are given by the (normalized) top eigenvector of $R$, and we show (as a particular case of our analysis) that parameter-free causal self-attention with $τ/d$ score scaling yields deterministic harmonic weights that improve signal recovery over mean pooling whenever early tokens carry more signal. Extensive simulations confirm sharp agreement between theory and finite-dimensional experiments.

Problem

Research questions and friction points this paper is trying to address.

attention mechanism

signal recovery

random matrices

sequence models

high-dimensional statistics

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention mechanism

random matrix theory

signal recovery