🤖 AI Summary
Existing speech-driven 3D facial animation methods often rely on discrete emotion labels, which limits their ability to achieve fine-grained and continuous emotional control. To address this limitation, this work proposes a controllable speech-driven animation framework that constructs a continuous expression manifold through boundary-aware semantic embeddings and introduces an emotion consistency loss to enhance semantic alignment between generated expressions and target emotions. The proposed approach maintains high-fidelity lip synchronization while significantly improving the controllability, expressiveness, and cross-emotion generalization of facial animations.
📝 Abstract
Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high-quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine-grained emotional control. We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary-aware semantic embedding that learns the normal directions of inter-emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.