CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

In multimodal in-context learning (ICL), excessive image tokens lead to inefficient inference and unstable performance. To address this, we propose a training-free, progressive adaptive pruning method tailored specifically for multimodal ICL. Unlike existing single-image-oriented pruning strategies, our approach dynamically evaluates token importance by modeling cross-modal (vision–language) interactions, integrating attention distribution and semantic relevance to realize two-stage progressive pruning. Evaluated across eight benchmarks, it achieves an average accuracy gain of 0.6%, reduces inference latency by 10.78%, and removes 77.8% of image tokens—significantly outperforming existing baselines. Our core contribution is the first adaptive token pruning framework explicitly designed for multimodal ICL, uniquely balancing computational efficiency with consistent performance improvement.

Technology Category

Application Category

📝 Abstract

Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

Problem

Research questions and friction points this paper is trying to address.

Redundant image tokens increase multimodal ICL inference cost

Existing pruning methods fail in multimodal ICL scenarios

Unstable performance due to sparse information in image tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pruning for multimodal ICL

Progressive pruning for cross-modal interactions

Enhances efficiency and performance simultaneously

🔎 Similar Papers

No similar papers found.