🤖 AI Summary
Existing visual in-context learning methods often rely on a single optimal prompt or simplistic fusion strategies, which fail to fully exploit diverse contextual information and thereby limit model reasoning capabilities. To address this limitation, this work proposes a multi-combination collaborative fusion framework that constructs three distinct contextual representation branches, each derived from different high-quality prompt combinations. Furthermore, the authors introduce a novel MULTI-VQGAN architecture designed to jointly parse multi-source collaborative signals, enabling a sophisticated multi-branch, multi-combination context fusion mechanism. Extensive experiments demonstrate that the proposed approach significantly enhances generalization, robustness, and accuracy across various tasks—including foreground segmentation, single-object detection, and image colorization—outperforming current state-of-the-art methods.
📝 Abstract
Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant"retrieve-then-prompt"approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.