🤖 AI Summary
Text-guided image editing suffers from inaccurate semantic localization and low editing fidelity. To address these challenges, we propose a novel paradigm featuring precise semantic localization and dual-level conditional control. First, we design a vision-text self-attention-enhanced cross-attention map localization method to achieve fine-grained regional semantic alignment. Second, we introduce a synergistic dual-level conditioning mechanism—operating jointly at the feature and latent levels—to inject region-specific prompts consistently. Third, we construct RW-800, the first high-resolution benchmark tailored for real-world scenarios, comprising 800 high-quality images. Implemented on the DiT architecture, our method achieves significant improvements on PIE-Bench and RW-800: +12.6% in local editing accuracy and +9.3% in background structural preservation, demonstrating superior fine-grained controllability and high-fidelity reconstruction capability.
📝 Abstract
This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.