🤖 AI Summary
The quadratic computational complexity, $O(N^2)$, of self-attention in Vision Transformers severely hinders efficient high-resolution image processing. This work is the first to observe that the attention matrix naturally approximates a block circulant with circulant blocks (BCCB) structure. Leveraging this insight, we propose a learnable circulant attention mechanism, enabling $O(N log N)$ computation via the Fast Fourier Transform (FFT). Unlike prior sparse or heuristic attention designs, our method requires no hand-crafted sparsity patterns while preserving the full modeling capacity of standard self-attention. Theoretically, we unify structured matrix analysis with attention reconstruction to formalize and exploit this inherent structure. Experiments across image classification, object detection, and semantic segmentation demonstrate performance on par with baseline Transformers, alongside substantial inference speedup. Our implementation is publicly available.
📝 Abstract
The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed extbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $mathcal{O}(Nlog N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $mathcal{O}(Nlog N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at https://github.com/LeapLabTHU/Circulant-Attention.