$mathcal{V}isimathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from prohibitively high inference overhead due to quadratic growth in cross-modal attention computation. Method: This work is the first to identify a three-stage, layer-wise discontinuous cross-modal interaction pattern in MLLMs; based on this insight, we propose VisiPruner—a training-free, dynamic token pruning framework that performs hierarchical, adaptive visual token removal by real-time analysis of vision–language information flow and visual token importance. Contribution/Results: VisiPruner is architecture-agnostic and achieves 99% reduction in vision-related attention computation and 53.9% reduction in total FLOPs on LLaVA-v1.5-7B, substantially outperforming existing pruning methods. It demonstrates strong generalization across diverse MLLM backbones and tasks, establishing a novel paradigm for efficient multimodal inference.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, extit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a extbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose emph{VisiPruner}, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead from quadratic attention growth in MLLMs
Understanding how MLLMs process and fuse multimodal information across layers
Developing efficient pruning methods aligned with cross-modal interaction dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes vision tokens using cross-modal dynamics analysis
Reduces attention computations by identifying critical fusion layers
Provides training-free framework for efficient multimodal models
🔎 Similar Papers
No similar papers found.
Y
Yingqi Fan
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo
A
Anhao Zhao
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo; Hong Kong Polytechnic University
Jinlan Fu
Jinlan Fu
National University of Singapore
Natural Language ProcessingVision and LanguageLarge Language Model
J
Junlong Tong
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo; Shanghai Jiao Tong University
H
Hui Su
Meituan Inc.
Yijie Pan
Yijie Pan
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo
W
Wei Zhang
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning