Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal in-context learning (ICL) faces two key challenges: (1) predefined or heuristic example selection lacks adaptability across diverse tasks, and (2) independent sample selection ignores inter-example dependencies, leading to redundancy and suboptimal performance. To address these, we propose the first exploration-exploitation reinforcement learning (RL) framework tailored for multimodal ICL. Our method jointly models vision-language modality fusion and demonstration sample co-selection in an end-to-end manner, enabling large vision-language models (LVLMs) to autonomously evolve optimal prompting strategies. We introduce an adaptive demonstration selection mechanism coupled with multimodal RL optimization—where rewards incorporate cross-modal alignment and task-specific accuracy. Evaluated on four VQA benchmarks, our approach significantly outperforms hand-crafted and heuristic baselines. Ablations confirm that explicit modeling of cross-modal interactions is critical for few-shot generalization. This work establishes a principled RL-based paradigm for adaptive, context-aware multimodal prompting.

Technology Category

Application Category

📝 Abstract
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-modal demonstration selection for LVLMs
Reducing information redundancy in in-context learning
Enhancing few-shot LVLM generalization via self-exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploration-exploitation RL for multi-modal ICL
Adaptive selection of integrated demonstrations
Self-refining LVLM policies for VQA
🔎 Similar Papers
No similar papers found.
C
Cheng Chen
State Key Laboratory of Virtual Reality Technology and Systems
Yunpeng Zhai
Yunpeng Zhai
Alibaba Group; Peking University
LLMReinforcement LearningMulti-agent SystemComputer Vision
Y
Yifan Zhao
State Key Laboratory of Virtual Reality Technology and Systems
Jinyang Gao
Jinyang Gao
Alibaba Group
Machine LearningLearning Systems.
Bolin Ding
Bolin Ding
Alibaba Group
DatabasesData PrivacyMachine Learning
J
Jia Li
State Key Laboratory of Virtual Reality Technology and Systems