Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

๐Ÿ“… 2023-08-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing image editing methods relying solely on textual or visual prompts struggle to simultaneously preserve semantic consistency and visual fidelity. This paper proposes a dual-modal editing framework that freezes the semantic latent space of diffusion models. It aligns high-level semantics between text and reference images via semantic latent space mapping and employs a lightweight adapter network for stepwise alignmentโ€”first matching semantic distributions, then refining local details. Crucially, the approach avoids fine-tuning the diffusion backbone, introducing only a minimal-parameter adapter network to enable fine-grained, high-fidelity, text-driven editing. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, with consistent gains in editing quality, semantic faithfulness, and visual naturalness.
๐Ÿ“ Abstract
The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a extit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Problem

Research questions and friction points this paper is trying to address.

Image Modification
Denoising Diffusion Models
Incoherent Results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Denoising Diffusion Model
Integrated Image and Text Cues
Pre-trained Model Utilization
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhanbo Feng
Department of Computer Science and Engineering, Shanghai Jiao Tong University
Zenan Ling
Zenan Ling
Huazhong University of Science and Technology
random matrix theorydeep learning theory
X
Xinyu Lu
Department of Computer Science and Engineering, Shanghai Jiao Tong University
C
Ci Gong
EIC, Huazhong University of Science and Technology
F
Feng Zhou
W
Wugedele Bao
School of Computer Science, Hohhot Minzu College
J
Jie Li
Department of Computer Science and Engineering, Shanghai Jiao Tong University
F
Fan Yang
Department of Computer Science and Engineering, Shanghai Jiao Tong University
Robert C. Qiu
Robert C. Qiu
Professor of Electrical Engineering, Tennessee Technological University
Deep LearningBig DataWireless CommunicationsSmart Grid