Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

This work addresses the lack of theoretical characterization of the singular value spectrum of attention matrices in self-attention mechanisms. Within the framework of random matrix theory, it establishes, for the first time, a Gaussian equivalence for such matrices. The study reveals that, under natural parameter scaling, the distribution of squared singular values markedly deviates from the Marchenko–Pastur law. By precisely controlling fluctuations in the normalization term and introducing an improved linearization technique based on the Taylor expansion of the exponential function, the authors rigorously prove that the asymptotic spectrum is exactly captured by an analytically tractable linear model. They further identify the critical threshold governing the validity of this linearization. These results provide the first rigorous, computationally accessible spectral theory for self-attention, fundamentally advancing beyond prior empirical analyses and offering new insights into its statistical nature.

Technology Category

Application Category

📝 Abstract

Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

Problem

Research questions and friction points this paper is trying to address.

Analyzes singular value spectrum of attention matrix theoretically

Establishes Gaussian equivalence for self-attention in constant temperature regime

Identifies deviation from Marchenko-Pastur law in spectral distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Establishes Gaussian equivalence for attention matrix

Derives asymptotic singular value distribution characterization

Identifies threshold for linearization using Taylor expansions

🔎 Similar Papers

No similar papers found.