DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Text-guided image editing suffers from inaccurate semantic localization and low editing fidelity. To address these challenges, we propose a novel paradigm featuring precise semantic localization and dual-level conditional control. First, we design a vision-text self-attention-enhanced cross-attention map localization method to achieve fine-grained regional semantic alignment. Second, we introduce a synergistic dual-level conditioning mechanism—operating jointly at the feature and latent levels—to inject region-specific prompts consistently. Third, we construct RW-800, the first high-resolution benchmark tailored for real-world scenarios, comprising 800 high-quality images. Implemented on the DiT architecture, our method achieves significant improvements on PIE-Bench and RW-800: +12.6% in local editing accuracy and +9.3% in background structural preservation, demonstrating superior fine-grained controllability and high-fidelity reconstruction capability.

Technology Category

Application Category

📝 Abstract

This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.

Problem

Research questions and friction points this paper is trying to address.

Precisely localize and edit target semantics in images

Enhance cross-attention for better regional editing cues

Achieve fine-grained control at feature and latent levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Precise Semantic Localization via self-attention

Dual-Level Control at feature and latent levels

RW-800 benchmark for comprehensive evaluation

🔎 Similar Papers

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control