Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing visual in-context learning methods, where prompt retrieval ignores label information and often retrieves visually similar but semantically inconsistent examples, degrading performance. To overcome this, the authors propose LaPR, a novel framework that explicitly integrates joint image-label representations into prompt retrieval for the first time. LaPR introduces a query-adaptive mixture-of-experts routing mechanism to enable label-aware representations even for unlabeled queries. By combining a dual-encoder architecture, contrastive learning, and an alternating optimization strategy, the method consistently improves performance across diverse tasks—including segmentation, detection, and colorization—and demonstrates strong generalization across different feature extractors and cross-fold settings.
📝 Abstract
Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.
Problem

Research questions and friction points this paper is trying to address.

visual in-context learning
prompt retrieval
label consistency
foundation models
demonstrative prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

label-aware prompt retrieval
visual in-context learning
mixture-of-experts
joint image-label representation
contrastive loss
🔎 Similar Papers
No similar papers found.
T
Tianci Luo
Tsinghua Shenzhen International Graduate School, Tsinghua University
H
Haohao Pan
School of Computer Science and Engineering, Northeastern University
J
Jinpeng Wang
Harbin Institute of Technology, Shenzhen
N
Niu Lian
Harbin Institute of Technology, Shenzhen
Xinrui Chen
Xinrui Chen
Tsinghua University
Efficient Deep LearningComputer Vision
B
Bin Chen
Harbin Institute of Technology, Shenzhen
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security
Chun Yuan
Chun Yuan
Graduate School at Shenzhen, Tsinghua University
Computer visionmultimedia access control