Vision Transformer with Sparse Scan Prior

📅 2024-05-22

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the high computational cost and limited biological plausibility of global self-attention in Vision Transformers (ViTs), this paper draws inspiration from the human eye’s sparse saccadic scanning mechanism and proposes Sparse Saccadic Self-Attention (S³A). S³A models visual attention as learnable, anchor-guided local attention, formally characterizing sparse scanning and enabling end-to-end optimization without auxiliary supervision. Based on S³A, we introduce the Sparse Saccadic ViT (SSViT) architecture. On ImageNet, SSViT achieves 84.4% and 85.7% top-1 accuracy with only 4.4G and 18.2G FLOPs, respectively—substantially outperforming ViT counterparts of comparable scale. Moreover, SSViT establishes new state-of-the-art results across diverse downstream tasks, including object detection, instance segmentation, and semantic segmentation. The approach simultaneously delivers high computational efficiency, strong generalization, and enhanced biological interpretability—bridging algorithmic performance with neurocognitive principles.

Technology Category

Application Category

📝 Abstract

In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a extbf{S}parse extbf{S}can extbf{S}elf- extbf{A}ttention mechanism ($ m{S}^3 m{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $ m{S}^3 m{A}$, we introduce the extbf{S}parse extbf{S}can extbf{Vi}sion extbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of extbf{84.4%/85.7%} with extbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at url{https://github.com/qhfan/SSViT}.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in Vision Transformers' global modeling

Mimicking human eye's sparse scanning for efficient information processing

Avoiding redundant global modeling while preserving spatial information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Scan Self-Attention mechanism reduces computation

Local attention around predefined Anchors of Interest

SSViT achieves high accuracy with lower FLOPs

🔎 Similar Papers

No similar papers found.