LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-driven 3D facial animation methods rely on explicit identity/emotion labels (e.g., one-hot encodings), resulting in poor generalization and neglecting implicit emotional cues embedded in speech—thereby limiting animation naturalness and cross-speaker adaptability. To address this, we propose a label-free implicit feature representation framework: a dual-branch Transformer jointly models speech temporal dynamics and identity priors from neutral facial meshes; a Hierarchical Interactive Fusion Block (HIFB) unifies emotional, motion, and identity cues at the token level. Our approach enables end-to-end implicit disentangled learning without any manual emotion or identity annotations. Evaluated on the 3DMEAD dataset, our method achieves state-of-the-art performance in emotional expressiveness, zero-shot speaker generalization, and visual realism of synthesized animations.

Technology Category

Application Category

📝 Abstract
Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.
Problem

Research questions and friction points this paper is trying to address.

Generates 3D facial animation from speech without identity labels
Eliminates reliance on explicit emotion encodings for generalization
Integrates emotional cues from speech for realistic animations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicitly extracts emotion from speech features
Captures identity from neutral facial mesh
Hierarchical fusion block integrates emotional and identity cues
🔎 Similar Papers
No similar papers found.
X
Xin Lu
University of Chinese Academy of Sciences, China and Zhongguancun Academy, China
C
Chuanqing Zhuang
University of Chinese Academy of Sciences, China
C
Chenxi Jin
National University of Singapore, Singapore
Zhengda Lu
Zhengda Lu
中国科学院大学
计算机图形学、计算机视觉
Yiqun Wang
Yiqun Wang
Chongqing University ⇐ KAUST.edu.sa ⇐ ia.CAS.cn
Computer GraphicsGeometric LearningGeometric Processing
W
Wu Liu
University of Science and Technology of China, China
J
Jun Xiao
University of Chinese Academy of Sciences, China and Zhongguancun Academy, China