Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Token compression in vision-language models (VLMs) reduces computational cost but often induces output distribution shifts and prediction inconsistency—risks poorly captured by conventional accuracy metrics. Method: We first identify a strong correlation between the inverse participation ratio (IPR) of the singular value spectrum—induced by visual token pruning—and degradation in model consistency, revealing energy redistribution in low-dimensional representations as the underlying mechanism. Building on this insight, we propose LoFi, a training-free token pruning method integrating SVD-based low-rank approximation, leverage score analysis, and IPR-driven dynamic pruning. Contribution/Results: Experiments show that LoFi achieves near-lossless accuracy (±0.3% change) while substantially improving output consistency (+12.7% on average), outperforming state-of-the-art methods. LoFi provides an interpretable, lightweight, and plug-and-play solution for deploying high-reliability VLMs.

Technology Category

Application Category

📝 Abstract

Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model's output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM's internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model's consistency after token reduction. Based on these insights, we propose LoFi--a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.

Problem

Research questions and friction points this paper is trying to address.

Token reduction in VLMs alters output distribution.

Standard metrics fail to capture prediction inconsistencies.

LoFi method improves consistency and reduces computational costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Singular Value Decomposition for analysis

Proposes LoFi method for token reduction

Focuses on output consistency in VLMs

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM