🤖 AI Summary
Existing speech-driven 3D facial animation methods rely on explicit identity/emotion labels (e.g., one-hot encodings), resulting in poor generalization and neglecting implicit emotional cues embedded in speech—thereby limiting animation naturalness and cross-speaker adaptability. To address this, we propose a label-free implicit feature representation framework: a dual-branch Transformer jointly models speech temporal dynamics and identity priors from neutral facial meshes; a Hierarchical Interactive Fusion Block (HIFB) unifies emotional, motion, and identity cues at the token level. Our approach enables end-to-end implicit disentangled learning without any manual emotion or identity annotations. Evaluated on the 3DMEAD dataset, our method achieves state-of-the-art performance in emotional expressiveness, zero-shot speaker generalization, and visual realism of synthesized animations.
📝 Abstract
Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.