CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal in-context learning (ICL), excessive image tokens lead to inefficient inference and unstable performance. To address this, we propose a training-free, progressive adaptive pruning method tailored specifically for multimodal ICL. Unlike existing single-image-oriented pruning strategies, our approach dynamically evaluates token importance by modeling cross-modal (vision–language) interactions, integrating attention distribution and semantic relevance to realize two-stage progressive pruning. Evaluated across eight benchmarks, it achieves an average accuracy gain of 0.6%, reduces inference latency by 10.78%, and removes 77.8% of image tokens—significantly outperforming existing baselines. Our core contribution is the first adaptive token pruning framework explicitly designed for multimodal ICL, uniquely balancing computational efficiency with consistent performance improvement.

Technology Category

Application Category

📝 Abstract
Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.
Problem

Research questions and friction points this paper is trying to address.

Redundant image tokens increase multimodal ICL inference cost
Existing pruning methods fail in multimodal ICL scenarios
Unstable performance due to sparse information in image tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pruning for multimodal ICL
Progressive pruning for cross-modal interactions
Enhances efficiency and performance simultaneously
🔎 Similar Papers
No similar papers found.