🤖 AI Summary
This work addresses the challenge of achieving independent, non-interfering simultaneous editing of multiple regions in multi-instance scenarios—a limitation of existing flow-matching-based image editing methods. To this end, the authors propose an instance-decoupled attention mechanism that explicitly binds textual instructions to their corresponding spatial regions within the flow-matching framework, enabling locally controllable multi-instance editing in a single forward pass. By circumventing the constraints imposed by global conditional velocity fields and joint attention mechanisms on edit disentanglement, the method achieves high-fidelity, spatially precise, and globally consistent edits. Its effectiveness is validated on both natural images and a newly introduced benchmark of text-dense infographics, demonstrating superior performance in complex, multi-object editing tasks.
📝 Abstract
Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.