LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing speech-driven 3D facial animation methods rely on explicit identity/emotion labels (e.g., one-hot encodings), resulting in poor generalization and neglecting implicit emotional cues embedded in speech—thereby limiting animation naturalness and cross-speaker adaptability. To address this, we propose a label-free implicit feature representation framework: a dual-branch Transformer jointly models speech temporal dynamics and identity priors from neutral facial meshes; a Hierarchical Interactive Fusion Block (HIFB) unifies emotional, motion, and identity cues at the token level. Our approach enables end-to-end implicit disentangled learning without any manual emotion or identity annotations. Evaluated on the 3DMEAD dataset, our method achieves state-of-the-art performance in emotional expressiveness, zero-shot speaker generalization, and visual realism of synthesized animations.

Technology Category

Application Category

📝 Abstract

Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.

Problem

Research questions and friction points this paper is trying to address.

Generates 3D facial animation from speech without identity labels

Eliminates reliance on explicit emotion encodings for generalization

Integrates emotional cues from speech for realistic animations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicitly extracts emotion from speech features

Captures identity from neutral facial mesh

Hierarchical fusion block integrates emotional and identity cues

🔎 Similar Papers

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

2024-08-12arXiv.orgCitations: 0