ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting

๐Ÿ“… 2026-01-05
๐Ÿ›๏ธ IEEE Transactions on Visualization and Computer Graphics
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing audio-driven facial animation methods struggle to efficiently generate high-quality talking-face videos that simultaneously exhibit expressive emotions, distinctive styles, and 3D consistency. To address this challenge, this work proposes the first audio-driven facial animation framework based on 3D Gaussian Splatting that enables explicit control over both emotion and style. The approach introduces an emotion-aware spatial attention mechanism guided by affective audio cues to effectively fuse prosodic and emotional features. It further designs two dedicated 3D Gaussian deformation predictors to separately model geometry deformations driven by emotion and style. A multi-stage training strategy is employed to progressively optimize lip-sync accuracy, facial expressiveness, and stylistic fidelity. Experiments demonstrate that the proposed method outperforms state-of-the-art approaches in terms of lip-sync precision, emotional expressiveness, and stylistic richness, while maintaining high computational efficiency and strong 3D consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the characterโ€™s lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.
Problem

Research questions and friction points this paper is trying to address.

audio-driven facial animation
emotional expression
stylized animation
3D Gaussian Splatting
talking head generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
Emotional Facial Animation
Audio-Driven Talking Head
Stylized Animation
Spatial Attention Mechanism
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chuhang Ma
JHC & AI Institute, Shanghai Jiao Tong University, Shanghai, China
S
Shuai Tan
JHC & AI Institute, Shanghai Jiao Tong University, Shanghai, China
Ye Pan
Ye Pan
Associate Professor @ SJTU, Associate Research Scientist, Disney Research
AR/VRAvatarsAnimationsComputer GraphicsComputer Human Interaction
Jiaolong Yang
Jiaolong Yang
Microsoft Research
3D Computer Vision
X
Xin Tong
Microsoft Research Asia