🤖 AI Summary
This work addresses the lack of theoretical characterization of the singular value spectrum of attention matrices in self-attention mechanisms. Within the framework of random matrix theory, it establishes, for the first time, a Gaussian equivalence for such matrices. The study reveals that, under natural parameter scaling, the distribution of squared singular values markedly deviates from the Marchenko–Pastur law. By precisely controlling fluctuations in the normalization term and introducing an improved linearization technique based on the Taylor expansion of the exponential function, the authors rigorously prove that the asymptotic spectrum is exactly captured by an analytically tractable linear model. They further identify the critical threshold governing the validity of this linearization. These results provide the first rigorous, computationally accessible spectral theory for self-attention, fundamentally advancing beyond prior empirical analyses and offering new insights into its statistical nature.
📝 Abstract
Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.