🤖 AI Summary
Vision-language models suffer from low inference efficiency due to excessive visual tokens output by the vision encoder. This paper proposes a training-free hierarchical visual token pruning method: it first leverages mid-layer attention maps from the vision encoder to generate fine-grained importance scores, then applies 2D spatial reshaping to jointly preserve critical tokens in two stages—globally (region-level) and locally (window-level). The approach preserves both semantic integrity and spatial structure, substantially accelerating inference without accuracy loss. On LLaVA-v1.5-7B and LLaVA-Next-7B, first-token latency is reduced by 50.0% and 55.1%, respectively, while generation throughput increases by 60.9% and 47.3%; notably, accuracy improves on several benchmarks. The core contributions are (i) a mid-layer attention–driven importance scoring mechanism and (ii) a lightweight, efficient, and lossless hierarchical pruning paradigm.
📝 Abstract
Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.