🤖 AI Summary
This work addresses controllable animation generation and novel-view synthesis from a single portrait image. Methodologically: (1) it introduces the first implicit disentanglement mechanism for facial expression latent variables, explicitly separating expression from head pose in driving videos; (2) it designs a diffusion Transformer with expressive controller injection to enable fine-grained motion transfer; and (3) it integrates Plücker ray representation with normal-map rendering to jointly model geometry-consistent camera pose and viewpoint control. The framework achieves state-of-the-art performance across four quantitative metrics—realism, expressiveness, control accuracy, and view consistency. Notably, this is the first method enabling explicit, independent control over expression, head pose, and camera viewpoint from a single input image, while guaranteeing cross-view geometric consistency throughout dynamic animation sequences.
📝 Abstract
We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.