Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the high computational cost, information redundancy, and excessive memory consumption caused by large image support sets in multimodal in-context learning (ICL), this paper proposes KeCO—a novel framework that pioneers a key-guided, coreset-based dynamic evolution mechanism, where visual features serve as keys to steer adaptive coreset refinement. Unselected samples continuously update the key-value representations, enabling information-enriched yet lightweight support set construction with minimal overhead. The method integrates visual feature extraction, key-conditioned sampling, iterative coreset optimization, and multi-granularity evaluation. On both coarse- and fine-grained image classification benchmarks, KeCO improves ICL accuracy by over 20% on average. Online simulation experiments demonstrate substantial efficiency gains: 38% reduction in GPU memory usage and 41% decrease in inference latency—achieving a compelling balance of effectiveness and practicality.

Technology Category

Application Category

📝 Abstract

In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing coreset selection for image classification efficiency

Reducing computational costs in multimodal in-context learning

Enhancing LVLM performance with minimal information loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-based Coreset Optimization (KeCO) framework

Visual features as keys for coreset

Low-cost evolution of informative coreset

🔎 Similar Papers

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification