HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency and substantial computational overhead of multimodal large language models (MLLMs) caused by excessive visual tokens, which hinder deployment in real-time or resource-constrained settings. The authors propose a training-free, dynamic visual token pruning method that introduces, for the first time, an attention-head importance-aware mechanism. By leveraging text-guided attention to evaluate token saliency, the approach overcomes the common assumption that all attention heads contribute equally. The method is compatible with various MLLMs and achieves state-of-the-art performance across multiple vision-language benchmarks. For instance, in Qwen2.5-VL, it prunes 80.2% of visual tokens while retaining 96.0% of the original accuracy, reduces end-to-end latency to 74.4% of the baseline, and significantly lowers GPU memory consumption.
📝 Abstract
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
Problem

Research questions and friction points this paper is trying to address.

visual token pruning
multimodal large language models
attention heads
computational overhead
inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token pruning
head importance awareness
multimodal large language models
training-free acceleration
text-guided attention
🔎 Similar Papers
No similar papers found.
Q
Qihui Zhu
University of Science and Technology of China
T
Tao Zhang
University of Science and Technology of China
Y
Yuchen Wang
University of Science and Technology of China
Z
Zijian Wen
University of Science and Technology of China
M
Mengjie Zhang
University of Science and Technology of China
S
Shuangwu Chen
University of Science and Technology of China
X
Xiaobin Tan
University of Science and Technology of China
J
Jian Yang
University of Science and Technology of China
Y
Yang Liu
ChangXin Memory Technologies, Inc
Zhenhua Dong
Zhenhua Dong
Noah's ark lab, Huawei Technologies Co., Ltd.
Recommender systemcausal inferencecountrfactual learningtrustworthy AImachine learning
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
Y
Yinfei Pan
Huawei Noah’s Ark Lab