Vision Transformers are Circulant Attention Learners

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

The quadratic computational complexity, $O(N^2)$, of self-attention in Vision Transformers severely hinders efficient high-resolution image processing. This work is the first to observe that the attention matrix naturally approximates a block circulant with circulant blocks (BCCB) structure. Leveraging this insight, we propose a learnable circulant attention mechanism, enabling $O(N log N)$ computation via the Fast Fourier Transform (FFT). Unlike prior sparse or heuristic attention designs, our method requires no hand-crafted sparsity patterns while preserving the full modeling capacity of standard self-attention. Theoretically, we unify structured matrix analysis with attention reconstruction to formalize and exploit this inherent structure. Experiments across image classification, object detection, and semantic segmentation demonstrate performance on par with baseline Transformers, alongside substantial inference speedup. Our implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed extbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $mathcal{O}(Nlog N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $mathcal{O}(Nlog N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at https://github.com/LeapLabTHU/Circulant-Attention.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of self-attention in vision Transformers

Maintains model capacity while enabling efficient computation

Proposes circulant attention as an alternative to standard self-attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Circulant Attention approximates self-attention as BCCB matrices

Efficient algorithm reduces complexity to O(N log N)

Maintains model capacity while accelerating computation

🔎 Similar Papers

No similar papers found.