Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This paper addresses the challenge of jointly preserving structural integrity and enabling unified multimodal guidance (textual and reference-based) in zero-shot image editing. Methodologically: (i) it leverages the diffusion inversion process to extract structural priors from the source image and introduces a timestep-adaptive null-text embedding to mitigate semantic drift; (ii) it proposes a staged latent-space injection strategy—injecting shape priors early and attribute details late in the denoising process; and (iii) it designs a reference-feature-driven cross-attention mechanism to achieve fine-grained semantic alignment. Evaluated on facial expression transfer, texture transformation, and style injection, the method achieves state-of-the-art performance, significantly improving editing diversity, structural fidelity, and cross-task generalization. To our knowledge, it is the first approach to seamlessly unify textual and reference-based guidance within a zero-shot diffusion framework without fine-tuning.

Technology Category

Application Category

📝 Abstract

We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.

Problem

Research questions and friction points this paper is trying to address.

Unifies text and reference-guided image editing without fine-tuning

Preserves source image structure via diffusion inversion and null-text embeddings

Enables precise modifications via stage-wise latent injection strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-wise latent injection strategy

Timestep-specific null-text embeddings

Cross-attention with reference latents

🔎 Similar Papers

PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing