🤖 AI Summary
To address the high computational cost and deployment challenges of multimodal large language models (MLLMs), this paper proposes Visual Token Withdrawal (VTW), a training-free, plug-and-play inference acceleration mechanism. Motivated by the newly identified phenomena of “attention settling” and “visual information migration”, we formulate the deep visual token redundancy hypothesis. VTW introduces a layer-adaptive withdrawal strategy that dynamically selects withdrawal layers via attention analysis and KL-divergence-based criteria, followed by token pruning and inference path redirection to enable text-token-only forward propagation. Crucially, VTW requires no fine-tuning or architectural modification. Evaluated across multiple benchmarks, it achieves over 40% average speedup with less than 0.5% performance degradation, significantly enhancing MLLM inference efficiency and practical deployability.
📝 Abstract
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.