🤖 AI Summary
This paper addresses the challenge of jointly preserving structural integrity and enabling unified multimodal guidance (textual and reference-based) in zero-shot image editing. Methodologically: (i) it leverages the diffusion inversion process to extract structural priors from the source image and introduces a timestep-adaptive null-text embedding to mitigate semantic drift; (ii) it proposes a staged latent-space injection strategy—injecting shape priors early and attribute details late in the denoising process; and (iii) it designs a reference-feature-driven cross-attention mechanism to achieve fine-grained semantic alignment. Evaluated on facial expression transfer, texture transformation, and style injection, the method achieves state-of-the-art performance, significantly improving editing diversity, structural fidelity, and cross-task generalization. To our knowledge, it is the first approach to seamlessly unify textual and reference-based guidance within a zero-shot diffusion framework without fine-tuning.
📝 Abstract
We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.