AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) suffer from quadratic computational complexity $O(n^2)$ in global self-attention and contain numerous redundant tokens for downstream tasks. To address this, we propose a differentiable anchor attention mechanism that approximates global token interactions using a small set of learnable anchor tokens. Our approach introduces differentiable anchor representations and models token-anchor interactions via a Markov process, enabling end-to-end joint optimization. We further design a bipartite graph attention module and a lightweight ViT architecture, unifying support for classification, detection, and segmentation. On ImageNet, our method achieves up to 9.0% higher top-1 accuracy or reduces FLOPs by 46.7%; on COCO detection, it improves mAP by 81.3% at comparable FLOPs. The core innovation lies in reformulating global attention approximation as a differentiable anchor learning problem—striking an optimal balance between efficiency and generalization across vision tasks.

Technology Category

Application Category

📝 Abstract
Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $mathcal{O}(n^2)$ to $mathcal{O}(mn)$, where $m$ is an anchor number and $m<n$. Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity in vision transformers
Focuses on pivotal image regions efficiently
Improves performance in classification, detection, segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor tokens reduce quadratic complexity to linear
Differentiable anchor attention via neural layer neurons
Extends to classification, detection, and segmentation tasks
🔎 Similar Papers
No similar papers found.
J
Jiquan Shan
PetroChina Changqing Oilfield Company, Xi’an, Shaanxi, China
Junxiao Wang
Junxiao Wang
KAUST, Postdoctoral Fellow
Generative AIDistributed Machine LearningAI Security and Privacy
L
Lifeng Zhao
PetroChina Changqing Oilfield Company, Xi’an, Shaanxi, China
L
Liang Cai
PetroChina Changqing Oilfield Company, Xi’an, Shaanxi, China
H
Hongyuan Zhang
The University of Hong Kong
I
Ioannis Liritzis
Alma Mater Europaea University, South China University of Technology, Guangzhou, Guangdong, China