Enhancing Visual In-Context Learning by Multi-Faceted Fusion

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing visual in-context learning methods often rely on a single optimal prompt or simplistic fusion strategies, which fail to fully exploit diverse contextual information and thereby limit model reasoning capabilities. To address this limitation, this work proposes a multi-combination collaborative fusion framework that constructs three distinct contextual representation branches, each derived from different high-quality prompt combinations. Furthermore, the authors introduce a novel MULTI-VQGAN architecture designed to jointly parse multi-source collaborative signals, enabling a sophisticated multi-branch, multi-combination context fusion mechanism. Extensive experiments demonstrate that the proposed approach significantly enhances generalization, robustness, and accuracy across various tasks—including foreground segmentation, single-object detection, and image colorization—outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant"retrieve-then-prompt"approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.

Problem

Research questions and friction points this paper is trying to address.

Visual In-Context Learning

prompt fusion

contextual representation

multi-prompt integration

visual prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-faceted Fusion

Visual In-Context Learning

Collaborative Representation