🤖 AI Summary
This work addresses the tendency of existing large diffusion Transformers to propagate editing effects beyond intended regions during local image editing, owing to the absence of explicit spatial localization mechanisms. To resolve this, the authors propose REDEdit, a framework that introduces lightweight region-aware block adapters and a SpatialGate routing mechanism atop a frozen DiT backbone, enabling structured injection of positional conditioning signals and effective disentanglement of editing semantics from spatial location. Additionally, a jointly trained MaskPredictor head enables, for the first time, end-to-end mask-free local editing by accurately localizing target regions without requiring user-provided masks. A novel Region-Aware Loss further enhances spatial precision. Experiments demonstrate that REDEdit achieves state-of-the-art performance on both MagicBrush and Emu-Edit Test benchmarks, significantly outperforming existing methods that rely either on ground-truth masks or operate without any mask guidance.
📝 Abstract
Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce REDEdit, a co-trained, instruction- and region-aware adapter framework that retrofits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, REDEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.