CROP: Contextual Region-Oriented Visual Token Pruning

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM)-based visual question answering (VQA) methods encode the entire image, generating numerous vision tokens irrelevant to the question—leading to prohibitive computational and memory overhead. To address this, we propose a two-stage context-aware region-guided vision token pruning framework. In the first stage, a lightweight localization model identifies question-relevant image regions. In the second stage, we jointly apply training-free Inner-LLM Pruning (ILP) and adaptive Pre-LLM Compression (PLC) to prune and compress vision tokens early in the LLM’s transformer layers, guided by the localized regions. Our method requires no fine-tuning and is architecture-agnostic, compatible with mainstream VLMs. Evaluated across multiple VQA benchmarks, it achieves state-of-the-art accuracy while significantly reducing GPU memory consumption (up to 42%) and inference latency (up to 38%), effectively balancing efficiency and performance.

Technology Category

Application Category

📝 Abstract
Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance. Our code and datasets will be made available.
Problem

Research questions and friction points this paper is trying to address.

Reduces excessive visual tokens in VLM-based VQA methods
Compresses visual tokens via localization and pruning strategies
Improves memory and computational efficiency in VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual region identification for token pruning
Pre-LLM adaptive compression of image regions
Training-free Inner-LLM token pruning guided by context
🔎 Similar Papers
No similar papers found.
Jiawei Guo
Jiawei Guo
Bupt & M-A-P
LLM MLLM
Feifei Zhai
Feifei Zhai
Institute of Automation, Chinese Academy of Sciences
Machine TranslationNatural Language ProcessingMachine Learning
Pu Jian
Pu Jian
China Academy of Sciences Institute of Automation
MultimodalMaching LearningNLP
Q
Qianrun Wei
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Y
Yu Zhou
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China