🤖 AI Summary
Existing video face reenactment methods suffer from limited shape consistency and motion controllability. To address this, we propose the first framework that embeds the FLAME 3D facial parametric model as a motion prior into a latent diffusion model (LDM), leveraging multi-modal geometric guidance—namely depth maps, normal maps, and rendered images—to achieve high-fidelity, temporally coherent generation. We introduce a novel multi-level facial motion fusion module and a parameterized identity-action alignment mechanism, enabling explicit disentanglement and joint modeling of identity, expression, and pose. Our method achieves state-of-the-art performance across multiple benchmarks: it supports fine-grained expression control, accurate head pose manipulation, and exhibits strong cross-domain generalization. The source code is publicly available.
📝 Abstract
In this paper, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This enables precise extraction of detailed face geometry and motion features from driving videos. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. A multi-layer face movements fusion module with integrated self-attention mechanisms is used to combine identity and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.