EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation

📅 2024-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-driven 3D facial animation methods struggle to disentangle phonetic content from emotional semantics, leading to coupled modeling of lip motion and facial expressions and distorted emotional rendering. To address this, we propose the first emotion–content dual-stream network, integrating Mesh Attention to enhance facial topology awareness and SpiralConv3D—a spatiotemporal graph convolution—to model holistic facial dynamics. We further introduce a self-growing intermediate supervision training paradigm to improve convergence and generalization. Our method achieves explicit, decoupled modeling of speech content and emotional semantics for the first time. Evaluated on 3D-RAVDESS and VOCASet, it sets new state-of-the-art performance: Lip Vertex Error (LVE) of 4.89×10⁻⁵ mm and Emotion Vertex Error (EVE) of 0.95×10⁻⁵ mm. This significantly advances high-fidelity, fine-grained, emotion-controllable 3D talking face generation.

Technology Category

Application Category

📝 Abstract
The creation of increasingly vivid 3D talking face has become a hot topic in recent years. Currently, most speech-driven works focus on lip synchronisation but neglect to effectively capture the correlations between emotions and facial motions. To address this problem, we propose a two-stream network called EmoFace, which consists of an emotion branch and a content branch. EmoFace employs a novel Mesh Attention mechanism to analyse and fuse the emotion features and content features. Particularly, a newly designed spatio-temporal graph-based convolution, SpiralConv3D, is used in Mesh Attention to learn potential temporal and spatial feature dependencies between mesh vertices. In addition, to the best of our knowledge, it is the first time to introduce a new self-growing training scheme with intermediate supervision to dynamically adjust the ratio of groundtruth adopted in the 3D face animation task. Comprehensive quantitative and qualitative evaluations on our high-quality 3D emotional facial animation dataset, 3D-RAVDESS ($4.8863 imes 10^{-5}$mm for LVE and $0.9509 imes 10^{-5}$mm for EVE), together with the public dataset VOCASET ($2.8669 imes 10^{-5}$mm for LVE and $0.4664 imes 10^{-5}$mm for EVE), demonstrate that our approach achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

3D Facial Animation
Emotional Expression
Voice Control
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmoFace Network
SpiralConv3D
Adaptive Growing Training Method
🔎 Similar Papers
2024-03-19IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 4
Y
Yihong Lin
South China University of Technology
L
Liang Peng
Huawei Cloud
X
Xianjia Wu
Huawei Cloud
J
Jianqiao Hu
South China University of Technology
X
Xiandong Li
Huawei Cloud
Wenxiong Kang
Wenxiong Kang
Professor,College of Automation Science and Engineering, South China University of
biometricsimage processingpattern recognition and computer vision
S
Songju Lei
Nanjing University
H
Huang Xu
Huawei Cloud