Image Editing As Programs with Diffusion Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Diffusion models struggle with semantic distortion and spatial misalignment in instruction-driven image editing—especially for large-scale, structurally inconsistent layout modifications. To address this, we propose the “image editing as programming” paradigm, decomposing complex edits into sequences of programmable atomic operations. We design a vision-language model (VLM)-driven lightweight adapter scheduling framework that operates atop a Diffusion Transformer (DiT) backbone, enabling modular, composable, and geometrically robust multi-step editing. Our key contribution is the first explicit formulation of image editing as a programmatic process, where a VLM dynamically orchestrates task-specific adapters to jointly preserve semantic consistency and geometric precision. Evaluated on multiple standard benchmarks, our method significantly outperforms state-of-the-art approaches, achieving substantial gains in both semantic fidelity and spatial accuracy for complex, multi-step editing tasks.

Technology Category

Application Category

📝 Abstract

While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in instruction-driven image editing with diffusion models

Focuses on structurally inconsistent edits involving layout changes

Introduces modular framework to decompose complex edits into atomic operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes edits into atomic operations

Uses DiT backbone with lightweight adapters

Programmed by VLM-based agent

🔎 Similar Papers

Streamlining Image Editing with Layered Diffusion Brushes