In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-driven image editing methods struggle to balance precision and efficiency: fine-tuning demands substantial data and computational resources, while training-free approaches suffer from limited semantic understanding and suboptimal editing quality. This paper proposes a zero-shot, instruction-driven editing framework that requires neither architectural modifications nor annotated data, instead leveraging in-context prompting to unlock the intrinsic semantic reasoning capabilities of diffusion Transformers (DiTs). Our key contributions are threefold: (1) the first in-context editing paradigm for diffusion models; (2) a LoRA-MoE hybrid fine-tuning strategy enabling dynamic expert routing and parameter-efficient optimization; and (3) a vision-language model (VLM)-guided early-noise filtering mechanism to enhance editing fidelity. Experiments demonstrate that our method surpasses state-of-the-art approaches using only 0.5% of training data and 1% trainable parameters, achieving significant gains in both editing accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.
Problem

Research questions and friction points this paper is trying to address.

Resolves precision-efficiency tradeoff in instruction-based image editing
Enables zero-shot instruction compliance without structural changes
Improves edit quality with efficient adaptation and dynamic expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context editing framework for zero-shot instruction compliance
LoRA-MoE hybrid tuning strategy for flexible adaptation
Early filter inference-time scaling with VLMs
🔎 Similar Papers
No similar papers found.
Zechuan Zhang
Zechuan Zhang
PhD student in Zhejiang University
3D VisionImage GenerationAI4Sci
Ji Xie
Ji Xie
Research Intern, UC Berkeley
Computer VisionImage GenerationMulti-Modal
Y
Yu Lu
ReLER, CCAI, Zhejiang University
Z
Zongxin Yang
DBMI, HMS, Harvard University
Y
Yi Yang
ReLER, CCAI, Zhejiang University