PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

πŸ“… 2026-03-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multi-prompt visual in-context learning methods are constrained by patch-wise fusion frameworks and model-agnostic supervision, limiting their ability to fully exploit complementary information across prompts. This work proposes the first locality-aware multi-prompt fusion paradigm, which introduces a spatial prior–guided local fusion mechanism to jointly optimize focus, alignment, and prediction objectives. Coupled with tailored data augmentation strategies, the approach transcends conventional fusion limitations, enabling richer contextual modeling and stronger training signals. The method substantially outperforms current state-of-the-art approaches across three fundamental vision tasks and demonstrates exceptional generalization, transferability, and out-of-distribution robustness.

Technology Category

Application Category

πŸ“ Abstract
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
Problem

Research questions and friction points this paper is trying to address.

Visual In-Context Learning
prompt fusion
patch-wise fusion
model-agnostic supervision
informative cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

locality-aware fusion
multi-prompt visual in-context learning
spatial priors
concentration and alignment
prompt fusion
πŸ”Ž Similar Papers
No similar papers found.
T
Tianci Luo
Tsinghua Shenzhen International Graduate School, Tsinghua University
J
Jinpeng Wang
Harbin Institute of Technology, Shenzhen
S
Shiyu Qin
Tsinghua Shenzhen International Graduate School, Tsinghua University
N
Niu Lian
Harbin Institute of Technology, Shenzhen
Yan Feng
Yan Feng
Hangzhou Institute of Advanced Study, UCAS
Raman lasersfiber lasersnonlinear photonicslaser guide staroptical magnetometry
B
Bin Chen
Harbin Institute of Technology, Shenzhen
Chun Yuan
Chun Yuan
Graduate School at Shenzhen, Tsinghua University
Computer visionmultimedia access control
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security