Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the severe computational bottleneck in vision-language models caused by the explosion in visual token count when processing high-resolution inputs. To overcome this challenge, the authors propose PromPrune, a novel framework that introduces, for the first time, a sample-adaptive, semantic saliency-aware mechanism to dynamically allocate token budgets between locally salient regions and globally diverse areas. PromPrune employs a two-stage token selection pipeline that integrates saliency analysis with dynamic budget allocation. Evaluated on LLaVA-NeXT-7B, the method achieves an 88% reduction in FLOPs and a 22% decrease in prefill latency while retaining 97.5% of the original model’s accuracy, thereby surpassing the limitations of conventional static compression strategies.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.

Problem

Research questions and friction points this paper is trying to address.

visual token compression

semantic prominence

saliency

coverage

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token compression

semantic prominence

sample-adaptive budgeting