DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-guided image editing suffers from inaccurate semantic localization and low editing fidelity. To address these challenges, we propose a novel paradigm featuring precise semantic localization and dual-level conditional control. First, we design a vision-text self-attention-enhanced cross-attention map localization method to achieve fine-grained regional semantic alignment. Second, we introduce a synergistic dual-level conditioning mechanism—operating jointly at the feature and latent levels—to inject region-specific prompts consistently. Third, we construct RW-800, the first high-resolution benchmark tailored for real-world scenarios, comprising 800 high-quality images. Implemented on the DiT architecture, our method achieves significant improvements on PIE-Bench and RW-800: +12.6% in local editing accuracy and +9.3% in background structural preservation, demonstrating superior fine-grained controllability and high-fidelity reconstruction capability.

Technology Category

Application Category

📝 Abstract
This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.
Problem

Research questions and friction points this paper is trying to address.

Precisely localize and edit target semantics in images
Enhance cross-attention for better regional editing cues
Achieve fine-grained control at feature and latent levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Precise Semantic Localization via self-attention
Dual-Level Control at feature and latent levels
RW-800 benchmark for comprehensive evaluation
Y
Yihan Hu
Institute of Information Science, Beijing Jiaotong University
J
Jianing Peng
Institute of Information Science, Beijing Jiaotong University
Yiheng Lin
Yiheng Lin
California Institute of Technology
Online Algorithmscontrol
T
Ting Liu
MT Lab, Meitu Inc
X
Xiaochao Qu
MT Lab, Meitu Inc
Luoqi Liu
Luoqi Liu
Director of MT Lab; Meitu
Computer Vision
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning