🤖 AI Summary
Diffusion models struggle with semantic distortion and spatial misalignment in instruction-driven image editing—especially for large-scale, structurally inconsistent layout modifications. To address this, we propose the “image editing as programming” paradigm, decomposing complex edits into sequences of programmable atomic operations. We design a vision-language model (VLM)-driven lightweight adapter scheduling framework that operates atop a Diffusion Transformer (DiT) backbone, enabling modular, composable, and geometrically robust multi-step editing. Our key contribution is the first explicit formulation of image editing as a programmatic process, where a VLM dynamically orchestrates task-specific adapters to jointly preserve semantic consistency and geometric precision. Evaluated on multiple standard benchmarks, our method significantly outperforms state-of-the-art approaches, achieving substantial gains in both semantic fidelity and spatial accuracy for complex, multi-step editing tasks.
📝 Abstract
While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.