Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing instruction-based image editing methods often erroneously modify irrelevant regions due to the absence of an explicit localization mechanism. This work proposes a training-free, task-aware editing localization framework that, for the first time, explicitly links edit mask construction to specific editing task types—such as addition, removal, or replacement. By leveraging dual-image-stream attention cues within a diffusion Transformer, the method generates feature centroids that adaptively delineate editable and non-editable regions, subsequently fusing them into a unified mask. Evaluated on EdiVal-Bench, the approach significantly improves consistency in non-edited regions while maintaining strong instruction-following fidelity. Furthermore, it is readily plug-and-play compatible with state-of-the-art models such as Step1X-Edit and Qwen-Image-Edit.

Technology Category

Application Category

📝 Abstract

Instruction-based image editing (IIE) aims to modify images according to textual instructions while preserving irrelevant content. Despite recent advances in diffusion transformers, existing methods often suffer from over-editing, introducing unintended changes to regions unrelated to the desired edit. We identify that this limitation arises from the lack of an explicit mechanism for edit localization. In particular, different editing operations (e.g., addition, removal and replacement) induce distinct spatial patterns, yet current IIE models typically treat localization in a task-agnostic manner. To address this limitation, we propose a training-free, task-aware edit localization framework that exploits the intrinsic source and target image streams within IIE models. For each image stream, We first obtain attention-based edit cues, and then construct feature centroids based on these attentive cues to partition tokens into edit and non-edit regions. Based on the observation that optimal localization is inherently task-dependent, we further introduce a unified mask construction strategy that selectively leverages source and target image streams for different editing tasks. We provide a systematic analysis for our proposed insights and approaches. Extensive experiments on EdiVal-Bench demonstrate our framework consistently improves non-edit region consistency while maintaining strong instruction-following performance on top of powerful recent image editing backbones, including Step1X-Edit and Qwen-Image-Edit.

Problem

Research questions and friction points this paper is trying to address.

instruction-based image editing

over-editing

edit localization

task-aware

diffusion transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-aware localization

instruction-based image editing

edit localization