AI killed the video star. Audio-driven diffusion model for expressive talking head generation

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses audio-driven high-fidelity talking head generation, aiming to jointly model lip motion, facial expressions, and head pose with precise audio-visual synchronization. To this end, we propose the conditional Motion Diffusion Transformer (cMDT), the first diffusion-based framework that explicitly models 3D facial landmark sequences conditioned on audio features and guided by a reference image. cMDT leverages Transformer architectures to capture long-range temporal dependencies and employs 3D facial representations to enhance motion naturalness and audio-visual alignment. Extensive experiments on VoxCeleb2 and CelebV-HQ demonstrate state-of-the-art performance, achieving superior scores in PSNR, LPIPS, and SyncNet metrics compared to existing methods. A user study further confirms significant improvements in visual realism and expressiveness. Our contributions include: (i) the integration of 3D facial dynamics into a conditional diffusion framework; (ii) a unified multimodal motion generation paradigm leveraging audio, reference image, and 3D priors; and (iii) empirical validation of enhanced synchronization and fidelity across quantitative and perceptual evaluations.

Technology Category

Application Category

📝 Abstract
We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.
Problem

Research questions and friction points this paper is trying to address.

Generates expressive talking heads from audio
Models facial motion using diffusion transformer
Outperforms existing methods in realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-driven diffusion model for talking heads
Conditional Motion Diffusion Transformer for facial motion
3D representation for realistic lip and expression generation
🔎 Similar Papers
No similar papers found.
B
Baptiste Chopin
Centre INRIA d’Université Côte d’Azur, France.
T
Tashvik Dhamija
Centre INRIA d’Université Côte d’Azur, France.
P
Pranav Balaji
Centre INRIA d’Université Côte d’Azur, France.
Yaohui Wang
Yaohui Wang
Research Scientist, Shanghai AI Laboratory | Inria
Machine LearningDeep Generative ModelsVideo Generation
Antitza Dantcheva
Antitza Dantcheva
Directrice de Recherche, Inria, France
Video generationDeepfake generation and detectionFace analysis for health monitoring and