FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion mode

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing audio-driven talking-head synthesis methods achieve notable naturalness improvements in zero-shot settings, yet lack fine-grained controllability over semantic dimensions such as head pose and facial expression—limiting their practical deployment in film production and e-commerce live streaming. To address this, we propose a conditional diffusion-based framework for controllable portrait video generation. Our approach is the first to end-to-end embed disentangled 3D head pose and expression parameters—including explicit modeling of rotation, blinking, and other semantic controls—directly into the diffusion process. By jointly optimizing 3D parameter estimation, audio feature encoding, and conditional diffusion modeling, we unify naturalness and precise controllability. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches in both perceptual quality and multi-granularity controllability. Moreover, it supports real-time editing and seamless integration across diverse rendering frameworks.

Technology Category

Application Category

📝 Abstract

Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.

Problem

Research questions and friction points this paper is trying to address.

Enhances controllability in video generation

Integrates 3D parameters for realistic avatars

Improves naturalness and manipulation in avatar expressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D head conditioned diffusion

independent head and expression control

audio-driven portrait video generation

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency