🤖 AI Summary
This work addresses the challenge of reproducing in-context learning (ICL) in vision due to task heterogeneity by proposing a unified visual reasoning framework. The approach formulates visual ICL as a conditional generation task grounded in visual analogy, leveraging a frozen Diffusion Transformer (DiT) equipped with a role-aware multi-image conditioning mechanism. To mitigate gradient interference across diverse tasks, the method employs a mixture-of-experts LoRA fine-tuning strategy. Additionally, the authors introduce the first large-scale visual in-context learning dataset encompassing perception, restoration, and editing tasks. Experimental results demonstrate that the proposed framework outperforms existing methods across a variety of visual tasks, validating the efficacy of a unified ICL paradigm—particularly in open-domain image editing scenarios.
📝 Abstract
Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A