๐ค AI Summary
Existing image editing methods relying solely on textual or visual prompts struggle to simultaneously preserve semantic consistency and visual fidelity. This paper proposes a dual-modal editing framework that freezes the semantic latent space of diffusion models. It aligns high-level semantics between text and reference images via semantic latent space mapping and employs a lightweight adapter network for stepwise alignmentโfirst matching semantic distributions, then refining local details. Crucially, the approach avoids fine-tuning the diffusion backbone, introducing only a minimal-parameter adapter network to enable fine-grained, high-fidelity, text-driven editing. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, with consistent gains in editing quality, semantic faithfulness, and visual naturalness.
๐ Abstract
The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a extit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.