Lightweight Structure-Aware Attention for Visual Understanding

📅 2022-11-29
🏛️ International Journal of Computer Vision
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers face two key bottlenecks in self-attention: weak discriminability and high computational complexity (O(N²)). This paper proposes Lightweight Structure-Aware Attention (LiSA), the first method to deeply integrate structural priors—modeling local image structure patterns via relative position embeddings—with frequency-domain approximation—achieving logarithmic-linear complexity O(N log N) via fast Fourier transform. LiSA significantly reduces redundancy while preserving strong representational capacity, thereby breaking the conventional quadratic complexity barrier. Evaluated on ImageNet, LiSA achieves state-of-the-art (SOTA) top-1 accuracy. It further demonstrates competitive performance on COCO object detection and Something-Something-V2 action recognition, confirming its generality and efficiency across diverse vision tasks.
📝 Abstract
Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity. Our operator learns structural patterns by using a set of relative position embeddings (RPEs). To achieve log-linear complexity, the RPEs are approximated with fast Fourier transforms. Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators, achieving state-of-the-art results on ImageNet, and competitive results on other visual understanding benchmarks such as COCO and Something-Something-V2. The source code of our approach will be released online.
Problem

Research questions and friction points this paper is trying to address.

Improves discriminative power of attention kernels
Reduces quadratic complexity to log-linear
Enhances visual tasks performance with structural patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Structure-aware Attention (LiSA) operator
Relative position embeddings (RPEs) as weights
Log-linear complexity approximation
🔎 Similar Papers
No similar papers found.
Heeseung Kwon
Heeseung Kwon
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK.
F
Francisco M. Castro
Department of Computer Architecture, University of Málaga
M
Manuel J. Marin-Jimenez
Department of Computing and Numerical Analysis, University of Córdoba
Nicolas Guil
Nicolas Guil
University of Málaga
Computer ArchitectureComputer Vision
Karteek Alahari
Karteek Alahari
Inria
Computer visionMachine learning