Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses limitations in existing Visual In-Context Learning (VICL) methods, which rely solely on the single most similar prompt and thus overlook complementary information from other high-quality prompts while failing to model the structured layout priors inherent in prompt permutations. To overcome these issues, the authors propose an end-to-end VICL framework that adaptively fuses key patterns and annotations from multiple prompts via a dedicated fusion module. A permutation-specific lightweight MLP is introduced to disentangle layout priors, and a bidirectional fine-tuning mechanism is designed to jointly optimize the fusion and restoration processes. The proposed method achieves state-of-the-art performance across multiple tasks—including foreground segmentation, single-object detection, and image colorization—demonstrating significant improvements in both contextual learning accuracy and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.

Problem

Research questions and friction points this paper is trying to address.

Vision In-Context Learning

prompt fusion

prompt arrangement

inpainting

contextual prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision In-Context Learning

Prompt Fusion

Arrangement-Specific MLPs