EmoVOCA: Speech-Driven Emotional 3D Talking Heads

📅 2024-03-19
🏛️ IEEE Workshop/Winter Conference on Applications of Computer Vision
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
3D talking-head generation faces two key challenges: difficulty in jointly modeling audio–facial dynamics and the lack of high-quality, emotion-aware 3D facial datasets. To address these, we introduce EmoVOCA—the first synthetic dataset specifically designed for emotive 3D talking heads—built by disentangling neutral 3D face geometry from controllable emotional motion sequences, enabling continuous control over both emotion categories and intensities. We further propose an end-to-end conditional diffusion framework that operates beyond conventional 3D Morphable Model (3DMM) parameter spaces, enabling audio-driven joint synthesis of high-fidelity lip synchronization and fine-grained emotional expressions. Quantitative evaluation on benchmarks (e.g., LMD, FDD) and comprehensive user studies demonstrate significant improvements over state-of-the-art methods. All code, pre-trained models, and the EmoVOCA dataset are publicly released, supporting real-time, multi-emotion, and multi-intensity 3D avatar generation.

Technology Category

Application Category

📝 Abstract
A notable challenge in 3D talking head generation consists in blending speech-related motions with expression dynamics. This is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Some literature works attempted to overcome such lack of data by fitting parametric 3D models (3DMMs) to 2D videos, and using the reconstructed 3D faces as replacement. However, their underlying parametric space limits the precision required to accurately reproduce convincing lip motions and synching, which is crucial for the application at hand. In this work, we look at the problem from a different perspective, and developed a data-driven technique to combine inexpressive 3D talking heads with a set of 3D expressive sequences, which we used for creating a synthetic dataset, called EmoVOCA. We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained models are available at https://github.com/miccunifi/EmoVOCA.
Problem

Research questions and friction points this paper is trying to address.

Blending speech motions with facial expressions in 3D talking heads
Lack of diverse 3D datasets combining speech and expressions
Joint modeling limitations in existing 2D and parametric 3D approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines inexpressive 3D heads with expressive sequences
Generates synthetic dataset EmoVOCA for training
Trains audio-sync emotional 3D talking head generator
F
Federico Nocentini
Media Integration and Communication Center (MICC), University of Florence, Italy
C
C. Ferrari
Department of Architecture and Engineering University of Parma, Italy
Stefano Berretti
Stefano Berretti
Professor of Computer Engineering, University of Firenze, Italy
3D Computer VisionPattern RecognitionBiometricsMachine Learning