FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address visual token redundancy in large vision-language models (VLMs)—which incurs quadratic attention overhead and degrades cross-modal semantic alignment—this paper proposes a text-guided dynamic visual token selection framework. Our method replaces unstable attention weights with an explicit cross-modal similarity metric, introduces a log-domain weighted fusion mechanism with temperature-based sharpening to enhance discriminability of salient tokens, and designs a diversity-aware background token retention strategy to preserve both semantic completeness and visual significance. Evaluated on LLaVA-1.5, our approach achieves up to 94.4% visual token compression while retaining 92.8% accuracy; remarkably, under 77.8% token pruning, accuracy even improves. It establishes new state-of-the-art trade-offs between efficiency and accuracy across 14 image and video benchmarks.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
Problem

Research questions and friction points this paper is trying to address.

Reduces visual token redundancy in multimodal models efficiently
Improves semantic alignment by text-guided token selection
Maintains accuracy under high compression with minimal token loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided dynamic visual token selection for queries
Cross-modal similarity fusion with saliency and diversity
Achieves lossless compression with high token pruning rates
🔎 Similar Papers
No similar papers found.
K
Kaitong Cai
Sun Yat-sen University
J
Jusheng Zhang
Sun Yat-sen University
J
Jing Yang
Sun Yat-sen University
Y
Yijia Fan
Sun Yat-sen University
Pengtao Xie
Pengtao Xie
Associate Professor, UC San Diego; Adjunct Faculty, MBZUAI
Machine Learning
J
Jian Wang
Snap Inc.
K
Keze Wang
Sun Yat-sen University