Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses limitations in existing Visual In-Context Learning (VICL) methods, which rely solely on the single most similar prompt and thus overlook complementary information from other high-quality prompts while failing to model the structured layout priors inherent in prompt permutations. To overcome these issues, the authors propose an end-to-end VICL framework that adaptively fuses key patterns and annotations from multiple prompts via a dedicated fusion module. A permutation-specific lightweight MLP is introduced to disentangle layout priors, and a bidirectional fine-tuning mechanism is designed to jointly optimize the fusion and restoration processes. The proposed method achieves state-of-the-art performance across multiple tasks—including foreground segmentation, single-object detection, and image colorization—demonstrating significant improvements in both contextual learning accuracy and cross-task generalization.

Technology Category

Application Category

📝 Abstract
Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.
Problem

Research questions and friction points this paper is trying to address.

Vision In-Context Learning
prompt fusion
prompt arrangement
inpainting
contextual prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision In-Context Learning
Prompt Fusion
Arrangement-Specific MLPs
Bidirectional Fine-Tuning
Cross-Task Generalization
🔎 Similar Papers
No similar papers found.
W
Wenwen Liao
College of Intelligent Robotics and Advance Manufacturing, Fudan University
Jianbo Yu
Jianbo Yu
Professor of School of Mechanical Engineering, Tongji University
Prognostics and Health ManagementCondition-Based MonitoringQuality ControlFault DiagnosisIndustrial Engineering
Y
Yuansong Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University
S
Shifu Yan
ByteDance Ltd.
X
Xiaofeng Yang
School of Microelectronics, Fudan University