$mathcal{V}isimathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from prohibitively high inference overhead due to quadratic growth in cross-modal attention computation. Method: This work is the first to identify a three-stage, layer-wise discontinuous cross-modal interaction pattern in MLLMs; based on this insight, we propose VisiPruner—a training-free, dynamic token pruning framework that performs hierarchical, adaptive visual token removal by real-time analysis of vision–language information flow and visual token importance. Contribution/Results: VisiPruner is architecture-agnostic and achieves 99% reduction in vision-related attention computation and 53.9% reduction in total FLOPs on LLaVA-v1.5-7B, substantially outperforming existing pruning methods. It demonstrates strong generalization across diverse MLLM backbones and tasks, establishing a novel paradigm for efficient multimodal inference.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, extit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a extbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose emph{VisiPruner}, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead from quadratic attention growth in MLLMs

Understanding how MLLMs process and fuse multimodal information across layers

Developing efficient pruning methods aligned with cross-modal interaction dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes vision tokens using cross-modal dynamics analysis

Reduces attention computations by identifying critical fusion layers

Provides training-free framework for efficient multimodal models

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs