FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address visual token redundancy in large vision-language models (VLMs)—which incurs quadratic attention overhead and degrades cross-modal semantic alignment—this paper proposes a text-guided dynamic visual token selection framework. Our method replaces unstable attention weights with an explicit cross-modal similarity metric, introduces a log-domain weighted fusion mechanism with temperature-based sharpening to enhance discriminability of salient tokens, and designs a diversity-aware background token retention strategy to preserve both semantic completeness and visual significance. Evaluated on LLaVA-1.5, our approach achieves up to 94.4% visual token compression while retaining 92.8% accuracy; remarkably, under 77.8% token pruning, accuracy even improves. It establishes new state-of-the-art trade-offs between efficiency and accuracy across 14 image and video benchmarks.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

Problem

Research questions and friction points this paper is trying to address.

Reduces visual token redundancy in multimodal models efficiently

Improves semantic alignment by text-guided token selection

Maintains accuracy under high compression with minimal token loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided dynamic visual token selection for queries

Cross-modal similarity fusion with saliency and diversity

Achieves lossless compression with high token pruning rates

🔎 Similar Papers

No similar papers found.

Authors to Follow