🤖 AI Summary
Semantic image editing faces two key bottlenecks: inversion-based methods suffer from reconstruction artifacts, while instruction-based approaches are constrained by the quality and scale of annotated instruction data. This paper introduces DescriptiveEdit, the first framework to formulate image editing as a reference-image-guided text-to-image generation task—eliminating both inversion and task-specific instruction datasets. Its core innovation is the Cross-Attentive UNet, which seamlessly fuses reference image features and textual descriptions via cross-attention, without modifying pre-trained model architectures or performing image inversion. The design natively supports integration with extension modules such as ControlNet and IP-Adapter. Evaluated on the Emu Edit benchmark, DescriptiveEdit achieves significant improvements in editing accuracy and cross-region consistency, demonstrating strong effectiveness, generalizability, and extensibility for complex semantic editing tasks.
📝 Abstract
Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.