🤖 AI Summary
Existing text-guided diffusion models struggle to simultaneously preserve identity fidelity and structural consistency in real-face editing. This paper proposes an ID-attribute disentangled inversion framework that enables zero-shot, purely text-driven multi-attribute face editing without any training. Methodologically, it employs joint conditional inversion to disentangle the latent space into identity-specific and appearance-attribute features, and introduces a reverse-diffusion mechanism to independently control both components; during generation, the disentangled representations collaboratively guide the diffusion process. Experiments demonstrate significant improvements over baselines in identity preservation (ID Similarity +12.3%), structural stability (LPIPS −0.18), and editing accuracy, with inference speed comparable to DDIM. The core contribution is the first zero-shot framework achieving complete disentanglement and independent control of identity and attributes, establishing an efficient, general-purpose paradigm for controllable face editing.
📝 Abstract
Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.