🤖 AI Summary
Existing diffusion models struggle with disentangled control and cross-attribute consistency in fine-grained facial attribute editing (e.g., pose, expression, illumination). This paper introduces RigFace, the first framework to deeply integrate a coarse-grained 3D face model with Stable Diffusion. We design a spatial attribute encoder and an identity encoder operating in tandem to inject 3D-aware conditioning into the UNet denoising process; feature modulation enables multi-attribute disentanglement while strongly preserving identity. RigFace achieves state-of-the-art performance in identity fidelity and perceptual realism, supporting high-precision, independently controllable editing of pose, expression, and illumination. Quantitative and qualitative evaluations demonstrate significant improvements in editing consistency and controllability over prior methods, without compromising visual quality or identity integrity.
📝 Abstract
Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.