VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of reproducing in-context learning (ICL) in vision due to task heterogeneity by proposing a unified visual reasoning framework. The approach formulates visual ICL as a conditional generation task grounded in visual analogy, leveraging a frozen Diffusion Transformer (DiT) equipped with a role-aware multi-image conditioning mechanism. To mitigate gradient interference across diverse tasks, the method employs a mixture-of-experts LoRA fine-tuning strategy. Additionally, the authors introduce the first large-scale visual in-context learning dataset encompassing perception, restoration, and editing tasks. Experimental results demonstrate that the proposed framework outperforms existing methods across a variety of visual tasks, validating the efficacy of a unified ICL paradigm—particularly in open-domain image editing scenarios.

Technology Category

Application Category

📝 Abstract

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

Problem

Research questions and friction points this paper is trying to address.

In-Context Learning

Computer Vision

Task Heterogeneity

Visual Reasoning

Visual Analogy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual In-Context Learning

Diffusion Transformer

Visual Analogy