🤖 AI Summary
To address visual token redundancy in large vision-language models (VLMs)—which incurs quadratic attention overhead and degrades cross-modal semantic alignment—this paper proposes a text-guided dynamic visual token selection framework. Our method replaces unstable attention weights with an explicit cross-modal similarity metric, introduces a log-domain weighted fusion mechanism with temperature-based sharpening to enhance discriminability of salient tokens, and designs a diversity-aware background token retention strategy to preserve both semantic completeness and visual significance. Evaluated on LLaVA-1.5, our approach achieves up to 94.4% visual token compression while retaining 92.8% accuracy; remarkably, under 77.8% token pruning, accuracy even improves. It establishes new state-of-the-art trade-offs between efficiency and accuracy across 14 image and video benchmarks.
📝 Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment.
We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context.
Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.