Infinite Self-Attention

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the quadratic computational complexity of Softmax-based attention in Transformers for high-resolution vision tasks by reinterpreting self-attention as a diffusion process on a content-adaptive graph. It establishes, for the first time, a theoretical connection between self-attention and graph centrality measures such as PageRank and Katz. The proposed Linear-InfSA achieves linear time complexity by approximating the dominant eigenvector through discounted Neumann series to accumulate multi-hop interactions, leveraging the fundamental matrix of an absorbing Markov chain together with a fixed-size auxiliary state—eliminating the need to explicitly construct the attention matrix. Evaluated on ImageNet-1K, the method attains 84.7% top-1 accuracy (+3.2%), supports inference at resolutions up to 9216×9216, delivers a 13× throughput gain, reduces energy consumption to 0.87 J per image, and avoids out-of-memory errors.

Technology Category

Application Category

📝 Abstract

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

Problem

Research questions and friction points this paper is trying to address.

self-attention

quadratic complexity

scalability

vision transformers

high-resolution vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinite Self-Attention

Linear-time Attention

Neumann Series