๐ค AI Summary
Existing visual token pruning methods suffer from poor generalization in complex multimodal reasoning tasks, often leading to significant performance degradation. This work is the first to identify and formally define the phenomenon of Relevant Visual Information Shift (RVIS), wherein the relevance of visual tokens dynamically evolves across decoding steps. To address this, we propose Decoding-stage-aware Token Pruning (DSTP), a training-free framework that adaptively adjusts pruning decisions by dynamically evaluating the importance of each visual token at every decoding step. Designed as a plug-and-play module, DSTP seamlessly integrates into prevailing multimodal large language models, substantially mitigating performance loss on diverse complex visual reasoning tasks while consistently improving results on standard vision-language understanding benchmarksโall with minimal computational overhead, demonstrating remarkable robustness and versatility.
๐ Abstract
Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.