Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

๐Ÿ“… 2026-04-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

192K/year
๐Ÿค– AI Summary
Existing visual token pruning methods suffer from poor generalization in complex multimodal reasoning tasks, often leading to significant performance degradation. This work is the first to identify and formally define the phenomenon of Relevant Visual Information Shift (RVIS), wherein the relevance of visual tokens dynamically evolves across decoding steps. To address this, we propose Decoding-stage-aware Token Pruning (DSTP), a training-free framework that adaptively adjusts pruning decisions by dynamically evaluating the importance of each visual token at every decoding step. Designed as a plug-and-play module, DSTP seamlessly integrates into prevailing multimodal large language models, substantially mitigating performance loss on diverse complex visual reasoning tasks while consistently improving results on standard vision-language understanding benchmarksโ€”all with minimal computational overhead, demonstrating remarkable robustness and versatility.

Technology Category

Application Category

๐Ÿ“ Abstract
Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.
Problem

Research questions and friction points this paper is trying to address.

visual token pruning
multimodal large language models
complex visual reasoning
relevant visual information shift
decoding stage
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token pruning
relevant visual information shift
decoding-stage adaptation
multimodal large language models
training-free framework