S2AFormer: Strip Self-Attention for Efficient Vision Transformer

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) suffer from quadratic computational complexity in global self-attention, hindering efficiency. To address this, we propose Strip Self-Attention (SSA), which reduces attention cost via structured dimensionality reduction: compressing spatial dimensions of keys/values and channel dimensions of queries/keys. We further design Hybrid Perception Blocks (HPBs) that seamlessly integrate CNNs’ local inductive bias with Transformers’ global modeling capacity. Jointly, SSA and HPBs form a lightweight, efficient ViT architecture, enhanced by optimized matrix operations. Evaluated on ImageNet-1k, ADE20k, and COCO, our method achieves significant inference speedup while maintaining or improving accuracy—demonstrating strong efficiency and generalization across both GPU and non-GPU platforms. Our core contributions are twofold: (i) the first introduction of structured dimensionality reduction into self-attention mechanisms, and (ii) establishing a novel local–global collaborative perception paradigm for vision modeling.

Technology Category

Application Category

📝 Abstract
Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reducing the spatial dimensions of $K$ and $V$ while compressing the channel dimensions of $Q$ and $K$. This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in Vision Transformers
Integrates CNN local perception with Transformer global context
Improves efficiency while preserving accuracy in vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strip Self-Attention reduces computational overhead
Hybrid Perception Blocks integrate CNNs and Transformers
Compress spatial and channel dimensions for efficiency
🔎 Similar Papers
No similar papers found.
G
Guoan Xu
Faculty of Engineering and Information Technology, University of Technology Sydney
W
Wenfeng Huang
Faculty of Engineering and Information Technology, University of Technology Sydney
Wenjing Jia
Wenjing Jia
University of Technology Sydney
Image analysiscomputer visionpatter recognition
J
Jiamao Li
Bionic Vision System Laboratory, State Key Laboratory of Transducer Technology, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
Guangwei Gao
Guangwei Gao
Professor of PCALab@NJUST, IEEE/CCF/CSIG/CAAI/CAA Senior Member
Pattern RecognitionImage UnderstandingMachine Learning
G
Guojun Qi
Research Center for Industries of the Future and the School of Engineering, Westlake University, and OPPO Research