FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses controllable animation generation and novel-view synthesis from a single portrait image. Methodologically: (1) it introduces the first implicit disentanglement mechanism for facial expression latent variables, explicitly separating expression from head pose in driving videos; (2) it designs a diffusion Transformer with expressive controller injection to enable fine-grained motion transfer; and (3) it integrates Plücker ray representation with normal-map rendering to jointly model geometry-consistent camera pose and viewpoint control. The framework achieves state-of-the-art performance across four quantitative metrics—realism, expressiveness, control accuracy, and view consistency. Notably, this is the first method enabling explicit, independent control over expression, head pose, and camera viewpoint from a single input image, while guaranteeing cross-view geometric consistency throughout dynamic animation sequences.

Technology Category

Application Category

📝 Abstract

We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

Problem

Research questions and friction points this paper is trying to address.

Generates lifelike portrait animations from single images

Transfers facial expressions and head poses from driving videos

Enables novel viewpoint synthesis using camera trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion method with disentangled expression, pose, viewpoint control

Expression controller injects latents into video diffusion transformer

Uses Plücker ray and normal maps for camera and pose control

🔎 Similar Papers

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control