Similarity-Aware Token Pruning: Your VLM but Faster

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) and Vision-Language Models (VLMs) suffer from high inference overhead due to the quadratic computational complexity of self-attention. Method: We propose SAINT, a training-free, dynamic hierarchical token pruning framework. It uncovers a universal three-stage token evolution pattern—aligner, explorer, aggregator—in Transformers, enabling aggressive early-stage pruning. Leveraging token similarity and graph-structured modeling, SAINT introduces a unified, modality-agnostic pruning paradigm that supports ViT-only, LLM-only, and hybrid VLM configurations, with cross-layer adaptive optimization of pruning ratios and redundancy thresholds. Results: On ImageNet-1K, ViT-H/14 achieves 2× throughput gain with only 0.6% top-1 accuracy drop. For LLaVA-13B, SAINT reduces token count by 75%, cutting latency to near that of LLaVA-7B while degrading multi-task performance by less than 1%.

Technology Category

Application Category

📝 Abstract
The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational demands of Vision Transformers and Vision-Language Models
Dynamically optimizes token pruning rates without training overhead
Improves inference efficiency with minimal accuracy loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token pruning framework SAINT
Graph-based dynamic pruning rate optimization
Universal three-stage token evolution process
🔎 Similar Papers
No similar papers found.