Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large Vision-Language Models (VLMs) suffer from excessive redundant visual tokens during inference, leading to high computational overhead. Existing token pruning methods either rely on unstable attention scores or erroneously remove critical regions due to overemphasis on token diversity. To address these limitations, we propose a training-free, zeroth-order gradient-driven token pruning framework. Our approach is the first to leverage zeroth-order gradient estimation for visual token sensitivity analysis: lightweight perturbations applied to the projection layer, followed by forward propagation, enable efficient quantification of each token’s impact on the final output—jointly capturing both saliency and information complementarity. This design avoids attention bias and diversity–relevance mismatch. Extensive experiments across diverse VLM architectures and benchmarks demonstrate its effectiveness: up to 94.4% of visual tokens can be pruned without accuracy degradation, yielding an end-to-end inference speedup of up to 2.30×.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space but risk dropping regions needed for accurate prediction. We propose ours, a training-free framework built on a simple intuition: tokens with higher sensitivity are more likely to influence the model's output, and they should also capture complementary visual cues rather than overlapping information. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the projection layer, a shallow and computationally light component of the model. This approach measures how small random perturbations affect the projection outputs, allowing us to approximate each token's influence through lightweight forward passes without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ours consistently outperforms prior methods, pruning up to 94.4% of tokens while maintaining accuracy and significantly improving efficiency, achieving up to 2.30x faster end-to-end inference over the baseline.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant visual tokens in vision-language models

Estimates token sensitivity without training via gradient perturbations

Maintains model accuracy while significantly improving inference speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token pruning via zeroth-order gradient estimation

Measures token sensitivity through lightweight forward passes

Prunes redundant visual tokens while maintaining model accuracy

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference