Learning to Select Visual In-Context Demonstrations

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing kNN-based methods for visual context example selection often suffer from redundancy and insufficient coverage in complex factual regression tasks. This work proposes LSD, the first approach to introduce reinforcement learning to this problem by formulating example selection as a sequential decision-making process. LSD jointly optimizes a policy using a Dueling DQN architecture and a query-centric Transformer decoder. Extensive experiments across five visual regression benchmarks demonstrate that LSD significantly outperforms kNN and other baselines on objective factual regression tasks while remaining competitive on subjective preference tasks. The results further reveal that the nature of the regression task—objective versus subjective—plays a critical role in shaping effective example selection strategies.
📝 Abstract
Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
Problem

Research questions and friction points this paper is trying to address.

in-context learning
demonstration selection
visual regression
multimodal large language models
factual regression tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context learning
demonstration selection
reinforcement learning
visual regression
multimodal large language models
🔎 Similar Papers
No similar papers found.
Eugene Lee
Eugene Lee
University of Cincinnati
Machine Learning
Y
Yu-Chi Lin
University of California, Los Angeles
J
Jiajie Diao
University of Cincinnati