🤖 AI Summary
This work addresses the inefficiency of multimodal large language models (MLLMs) when processing high-resolution images and videos, where a surge in visual tokens severely hampers inference speed. Existing pruning methods typically operate after visual encoding, failing to reduce computational overhead during the encoding phase itself. To overcome this limitation, the authors propose EvoPrune, the first approach to integrate layer-wise early token pruning directly within the visual encoding process. EvoPrune dynamically retains the most informative tokens by jointly considering token similarity, diversity, and attention weights. By moving pruning into the encoder, EvoPrune breaks free from the constraints of post-encoding strategies, achieving approximately 2× inference acceleration on image and video benchmarks such as VideoMME with less than 1% performance degradation, thereby significantly enhancing the deployment efficiency of MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.