🤖 AI Summary
Existing speech-driven 3D facial animation methods achieve high lip-sync accuracy but struggle to model subtle emotional cues embedded in speech, resulting in monotonous and insufficiently diverse facial expressions. To address this, we propose the first emotion-aware speech-driven 3D facial animation framework. Our method introduces three key innovations: (1) Dynamic Emotion Embedding (DEE), which explicitly models the stochastic mapping between speech and facial expressions; (2) a hierarchical temporal VQ-VAE (TH-VQVAE) enabling non-autoregressive codebook prediction and long-range temporal modeling of expressive dynamics; and (3) a joint training objective combining probabilistic contrastive learning and an emotion consistency loss to ensure cross-sample emotional coherence and discriminability. Extensive evaluations on multiple benchmarks demonstrate significant improvements: +32.7% in facial diversity (FDD) and +18.4% in emotion recognition accuracy, while maintaining state-of-the-art lip-sync precision (LSE < 1.2 mm).
📝 Abstract
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, a talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Our project page is available at https://whwjdqls.github.io/deeptalk_website/