PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Existing facial reenactment methods struggle to simultaneously achieve high expressiveness and fine-grained controllability. This work proposes a hierarchical disentanglement and composition framework that orthogonally decomposes facial motion into physical dynamics—comprising head pose and local expressions—and emotional content. An emotion filtering module is introduced to decouple local expressions from affective influence, enabling emotion-agnostic control. The approach integrates diffusion distillation, causal attention mechanisms, VAE-based acceleration, and a dedicated pose representation injection pathway. As a result, it enables high-fidelity, highly controllable real-time facial animation generation at 512×512 resolution with 20 FPS and an end-to-end latency of 800 milliseconds on a single RTX 5090 GPU.

Technology Category

Application Category

📝 Abstract
Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.
Problem

Research questions and friction points this paper is trying to address.

facial reenactment
controllability
expressiveness
disentanglement
real-time performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Disentanglement
Facial Reenactment
Emotion-Filtering Module
Real-time Performance
Motion Composition