Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) suffer from severe inference inefficiency due to the excessive number of visual tokens generated by image encoders; existing token pruning/merging methods lack standardized evaluation. To address this, we propose UniPruneBench—the first unified benchmark for visual token pruning in LMMs—covering six capability dimensions, ten diverse datasets, and three mainstream LMM architectures, and integrating ten representative compression algorithms. It introduces multi-dimensional metrics including accuracy, end-to-end latency, and prefill latency. We establish a standardized evaluation framework and uncover critical empirical insights: random pruning serves as a surprisingly strong baseline; task sensitivity varies significantly across capabilities; and pruning ratio dominates performance degradation—not algorithm choice. Experiments reveal OCR tasks are most vulnerable, and no single method consistently outperforms others across all settings. UniPruneBench advances visual token compression research toward standardization, reproducibility, and scalability.

Technology Category

Application Category

📝 Abstract
Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.
Problem

Research questions and friction points this paper is trying to address.

Addressing visual token redundancy in large multimodal models
Evaluating compression methods with unified benchmark protocols
Analyzing performance trade-offs across tasks and systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for visual token pruning
Standardized evaluation across multiple ability dimensions
System-level metrics including runtime and latency
🔎 Similar Papers
No similar papers found.
T
Tianfan Peng
Shandong University
Yuntao Du
Yuntao Du
Purdue University
Privacy
P
Pengzhou Ji
Tongji University
S
Shijie Dong
Shandong University
K
Kailin Jiang
University of Science and Technology of China
M
Mingchuan Ma
Sichuan University
Yijun Tian
Yijun Tian
Amazon AWS AI Lab
Large Language ModelsGraph Machine Learning
Jinhe Bi
Jinhe Bi
LMU Munich
Efficient AIM/LLM
Q
Qian Li
Shandong University
W
Wei Du
Shandong University
F
Feng Xiao
EB Tech Co., Ltd.
Lizhen Cui
Lizhen Cui
Shandong University
DatabasesBig DataArtificial IntelligenceData MiningCloud computing