🤖 AI Summary
Existing audio-driven talking-head synthesis methods achieve notable naturalness improvements in zero-shot settings, yet lack fine-grained controllability over semantic dimensions such as head pose and facial expression—limiting their practical deployment in film production and e-commerce live streaming. To address this, we propose a conditional diffusion-based framework for controllable portrait video generation. Our approach is the first to end-to-end embed disentangled 3D head pose and expression parameters—including explicit modeling of rotation, blinking, and other semantic controls—directly into the diffusion process. By jointly optimizing 3D parameter estimation, audio feature encoding, and conditional diffusion modeling, we unify naturalness and precise controllability. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches in both perceptual quality and multi-granularity controllability. Moreover, it supports real-time editing and seamless integration across diverse rendering frameworks.
📝 Abstract
Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.