🤖 AI Summary
Multimodal large language models (MLLMs) suffer from prohibitively high inference overhead due to quadratic growth in cross-modal attention computation. Method: This work is the first to identify a three-stage, layer-wise discontinuous cross-modal interaction pattern in MLLMs; based on this insight, we propose VisiPruner—a training-free, dynamic token pruning framework that performs hierarchical, adaptive visual token removal by real-time analysis of vision–language information flow and visual token importance. Contribution/Results: VisiPruner is architecture-agnostic and achieves 99% reduction in vision-related attention computation and 53.9% reduction in total FLOPs on LLaVA-v1.5-7B, substantially outperforming existing pruning methods. It demonstrates strong generalization across diverse MLLM backbones and tasks, establishing a novel paradigm for efficient multimodal inference.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, extit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a extbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose emph{VisiPruner}, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.