๐ค AI Summary
Existing audio-driven facial animation methods struggle to efficiently generate high-quality talking-face videos that simultaneously exhibit expressive emotions, distinctive styles, and 3D consistency. To address this challenge, this work proposes the first audio-driven facial animation framework based on 3D Gaussian Splatting that enables explicit control over both emotion and style. The approach introduces an emotion-aware spatial attention mechanism guided by affective audio cues to effectively fuse prosodic and emotional features. It further designs two dedicated 3D Gaussian deformation predictors to separately model geometry deformations driven by emotion and style. A multi-stage training strategy is employed to progressively optimize lip-sync accuracy, facial expressiveness, and stylistic fidelity. Experiments demonstrate that the proposed method outperforms state-of-the-art approaches in terms of lip-sync precision, emotional expressiveness, and stylistic richness, while maintaining high computational efficiency and strong 3D consistency.
๐ Abstract
Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the characterโs lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.