SAC-ViT: Semantic-Aware Clustering Vision Transformer with Early Exit

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of Vision Transformers (ViTs) on resource-constrained devices—stemming from the quadratic complexity of self-attention—the paper proposes an end-to-end trainable, two-stage semantic-aware inference architecture. In Stage I, an early-exit mechanism generates coarse predictions and localizes salient image regions. In Stage II, a novel non-iterative, semantic-aware dynamic token clustering (SAC) method models target regions at fine granularity while reusing non-target tokens to eliminate redundant computation. The approach integrates region cropping, embedding remapping, and local attention to substantially reduce computational load. Evaluated on the DeiT benchmark, it achieves a 62% reduction in FLOPs and a 1.98× throughput improvement, with zero accuracy degradation. This work establishes a new paradigm for efficient ViT deployment.

Technology Category

Application Category

📝 Abstract
The Vision Transformer (ViT) excels in global modeling but faces deployment challenges on resource-constrained devices due to the quadratic computational complexity of its attention mechanism. To address this, we propose the Semantic-Aware Clustering Vision Transformer (SAC-ViT), a non-iterative approach to enhance ViT's computational efficiency. SAC-ViT operates in two stages: Early Exit (EE) and Semantic-Aware Clustering (SAC). In the EE stage, downsampled input images are processed to extract global semantic information and generate initial inference results. If these results do not meet the EE termination criteria, the information is clustered into target and non-target tokens. In the SAC stage, target tokens are mapped back to the original image, cropped, and embedded. These target tokens are then combined with reused non-target tokens from the EE stage, and the attention mechanism is applied within each cluster. This two-stage design, with end-to-end optimization, reduces spatial redundancy and enhances computational efficiency, significantly boosting overall ViT performance. Extensive experiments demonstrate the efficacy of SAC-ViT, reducing 62% of the FLOPs of DeiT and achieving 1.98 times throughput without compromising performance.
Problem

Research questions and friction points this paper is trying to address.

Reduces Vision Transformer computational complexity for resource-constrained devices.
Enhances ViT efficiency via Semantic-Aware Clustering and Early Exit stages.
Improves performance by reducing spatial redundancy and optimizing attention mechanism.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early Exit stage reduces computational load
Semantic-Aware Clustering enhances token efficiency
End-to-end optimization boosts ViT performance
🔎 Similar Papers
No similar papers found.
Y
Youbing Hu
Faculty of Computing, Harbin Institute of Technology
Yun Cheng
Yun Cheng
Princeton University
large language modelsmultimodal machine learning
A
Anqi Lu
Faculty of Computing, Harbin Institute of Technology
Dawei Wei
Dawei Wei
School of Cyber Engineering, Xidian University
Z
Zhijun Li
Faculty of Computing, Harbin Institute of Technology