🤖 AI Summary
3D talking-head generation faces two key challenges: difficulty in jointly modeling audio–facial dynamics and the lack of high-quality, emotion-aware 3D facial datasets. To address these, we introduce EmoVOCA—the first synthetic dataset specifically designed for emotive 3D talking heads—built by disentangling neutral 3D face geometry from controllable emotional motion sequences, enabling continuous control over both emotion categories and intensities. We further propose an end-to-end conditional diffusion framework that operates beyond conventional 3D Morphable Model (3DMM) parameter spaces, enabling audio-driven joint synthesis of high-fidelity lip synchronization and fine-grained emotional expressions. Quantitative evaluation on benchmarks (e.g., LMD, FDD) and comprehensive user studies demonstrate significant improvements over state-of-the-art methods. All code, pre-trained models, and the EmoVOCA dataset are publicly released, supporting real-time, multi-emotion, and multi-intensity 3D avatar generation.
📝 Abstract
A notable challenge in 3D talking head generation consists in blending speech-related motions with expression dynamics. This is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Some literature works attempted to overcome such lack of data by fitting parametric 3D models (3DMMs) to 2D videos, and using the reconstructed 3D faces as replacement. However, their underlying parametric space limits the precision required to accurately reproduce convincing lip motions and synching, which is crucial for the application at hand. In this work, we look at the problem from a different perspective, and developed a data-driven technique to combine inexpressive 3D talking heads with a set of 3D expressive sequences, which we used for creating a synthetic dataset, called EmoVOCA. We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained models are available at https://github.com/miccunifi/EmoVOCA.