MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing visual token pruning methods rely on attention scores, which struggle to accurately capture cross-modal relevance, leading to underutilization of visual information in multimodal large language models. This work proposes a non-intrusive pruning strategy based on mutual information between visual and textual features, directly quantifying cross-modal dependencies prior to modality interaction—without requiring access to internal attention maps or architectural modifications. By introducing mutual information for the first time to assess token importance in a cross-modal context, the method substantially outperforms existing attention-based pruning approaches, achieving significantly improved model efficiency while maintaining minimal inference latency.

Technology Category

Application Category

📝 Abstract

For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

visual token pruning

crossmodal mutual information

efficient inference

attention-based pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mutual Information

Token Pruning

Crossmodal Dependency