Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing visual token pruning methods suffer from poor generalization in complex multimodal reasoning tasks, often leading to significant performance degradation. This work is the first to identify and formally define the phenomenon of Relevant Visual Information Shift (RVIS), wherein the relevance of visual tokens dynamically evolves across decoding steps. To address this, we propose Decoding-stage-aware Token Pruning (DSTP), a training-free framework that adaptively adjusts pruning decisions by dynamically evaluating the importance of each visual token at every decoding step. Designed as a plug-and-play module, DSTP seamlessly integrates into prevailing multimodal large language models, substantially mitigating performance loss on diverse complex visual reasoning tasks while consistently improving results on standard vision-language understanding benchmarks—all with minimal computational overhead, demonstrating remarkable robustness and versatility.

Technology Category

Application Category

📝 Abstract

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

Problem

Research questions and friction points this paper is trying to address.

visual token pruning

multimodal large language models

complex visual reasoning

relevant visual information shift

decoding stage

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token pruning

relevant visual information shift

decoding-stage adaptation