🤖 AI Summary
Visual language models (VLMs) incur substantial computational overhead during inference due to redundant visual tokens generated by high-resolution inputs. Existing token pruning methods often neglect semantic guidance from text prompts, compromising both task relevance and visual diversity. To address this, we propose a zero-shot, prompt-aware hierarchical visual token pruning framework. Our approach explicitly incorporates prompt semantics into the pruning process—distinguishing *task-critical tokens* essential for answer generation from *diversity-supplementary tokens* that preserve contextual richness. We model prompt–token interactions via attention mechanisms and introduce a hierarchical dynamic selection strategy that operates across spatial and semantic levels. Crucially, the method requires no fine-tuning or additional training. Extensive experiments across multiple VLMs (e.g., LLaVA, Qwen-VL) and benchmarks (e.g., MMStar, MME) demonstrate up to 90% visual token reduction with negligible accuracy degradation (<0.5% average drop), while achieving significant reductions in GPU memory consumption and inference latency.
📝 Abstract
As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.