STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) suffer from high inference overhead due to dense visual token representations. Method: This paper proposes a training-free, two-stage token compression framework. In the first stage, redundant low-level visual features are pruned via intra-modal visual self-attention; in the second stage, task-irrelevant tokens are filtered using cross-modal vision–text attention—enabling, for the first time, global information flow–guided, staged dynamic pruning. Contribution/Results: The method departs from conventional single-stage, local pruning paradigms and maintains or even improves accuracy under high pruning ratios. It is architecture-agnostic and plug-and-play, achieving up to 2.5× inference speedup on mainstream benchmarks while preserving or exceeding the original model’s accuracy.

Technology Category

Application Category

📝 Abstract
Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in large vision-language models
Improves token pruning via stage-wise attention guidance
Maintains performance while accelerating multimodal inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-wise attention-guided token reduction framework
Early-stage pruning via visual self-attention
Later-stage pruning via cross-modal attention