EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of multimodal large language models arising from processing excessive visual tokens. Existing pruning methods rely on heuristic, static layer selection, lacking interpretability and cross-model generalization. To overcome these limitations, the authors propose a dynamic visual token pruning framework grounded in matrix entropy. Their approach introduces, for the first time, the concept of an “entropy collapse layer”—identified via matrix entropy—as the point where information content drops sharply. Leveraging the spectral equivalence of dual Gram matrices, the method efficiently quantifies token informativeness without relying on attention maps. Evaluated on LLaVA-1.5-7B, it reduces FLOPs by 68.2% while retaining 96.0% of the original performance, significantly outperforming existing techniques. Moreover, the framework demonstrates strong generalization across high-resolution and video-based multimodal models.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Visual Token Pruning
Inference Cost
Token Redundancy
Model Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

matrix entropy
token pruning
multimodal large language models
entropy collapse layer
spectral equivalence
🔎 Similar Papers
No similar papers found.
Y
Yahong Wang
School of Computer Science and Technology, Tongji University
Juncheng Wu
Juncheng Wu
University of California, Santa Cruz
Foundation ModelsReasoning LLMs
Z
Zhangkai Ni
School of Computer Science and Technology, Tongji University
C
Chengmei Yang
School of Computer Science and Technology, Tongji University
Y
Yihang Liu
School of Computer Science and Technology, Tongji University
L
Longzhen Yang
School of Computer Science and Technology, Tongji University
Yuyin Zhou
Yuyin Zhou
Assistant Professor, Computer Science and Engineering, Genomics Institute, UC Santa Cruz
medical image analysismachine learningcomputer visionAI in healthcare
Ying Wen
Ying Wen
Associate Professor, Shanghai Jiao Tong University
Multi-Agent LearningReinforcement Learning
L
Lianghua He
Shanghai Eye Disease Prevention and Treatment Center