Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

📅 2023-08-30

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing image editing methods relying solely on textual or visual prompts struggle to simultaneously preserve semantic consistency and visual fidelity. This paper proposes a dual-modal editing framework that freezes the semantic latent space of diffusion models. It aligns high-level semantics between text and reference images via semantic latent space mapping and employs a lightweight adapter network for stepwise alignment—first matching semantic distributions, then refining local details. Crucially, the approach avoids fine-tuning the diffusion backbone, introducing only a minimal-parameter adapter network to enable fine-grained, high-fidelity, text-driven editing. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, with consistent gains in editing quality, semantic faithfulness, and visual naturalness.

📝 Abstract

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a extit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

Image Modification

Denoising Diffusion Models

Incoherent Results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Denoising Diffusion Model

Integrated Image and Text Cues

Pre-trained Model Utilization

🔎 Similar Papers

No similar papers found.