Enhancing Visual In-Context Learning by Multi-Faceted Fusion

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual in-context learning methods often rely on a single optimal prompt or simplistic fusion strategies, which fail to fully exploit diverse contextual information and thereby limit model reasoning capabilities. To address this limitation, this work proposes a multi-combination collaborative fusion framework that constructs three distinct contextual representation branches, each derived from different high-quality prompt combinations. Furthermore, the authors introduce a novel MULTI-VQGAN architecture designed to jointly parse multi-source collaborative signals, enabling a sophisticated multi-branch, multi-combination context fusion mechanism. Extensive experiments demonstrate that the proposed approach significantly enhances generalization, robustness, and accuracy across various tasks—including foreground segmentation, single-object detection, and image colorization—outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant"retrieve-then-prompt"approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.
Problem

Research questions and friction points this paper is trying to address.

Visual In-Context Learning
prompt fusion
contextual representation
multi-prompt integration
visual prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-faceted Fusion
Visual In-Context Learning
Collaborative Representation
MULTI-VQGAN
Top-K Prompt Integration
🔎 Similar Papers
No similar papers found.
W
Wenwen Liao
College of Intelligent Robotics and Advance Manufacturing, Fudan University
Jianbo Yu
Jianbo Yu
Professor of School of Mechanical Engineering, Tongji University
Prognostics and Health ManagementCondition-Based MonitoringQuality ControlFault DiagnosisIndustrial Engineering
Y
Yuansong Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Q
Qingchao Jiang
School of Information Science and Engineering, East China University of Science and Technology
X
Xiaofeng Yang
School of Microelectronics, Fudan University