🤖 AI Summary
This work addresses the limitations of existing speech-driven facial animation methods, which predominantly rely on monolingual data and struggle to accommodate linguistic variations and individual speaking styles in multilingual settings. The authors propose a unified diffusion model architecture that implicitly encodes language information through text embeddings and extracts stylistic representations from reference facial sequences, enabling personalized multilingual facial animation without requiring predefined language or speaker labels. Notably, this approach is the first to jointly model the interactive effects of language and speaking style, facilitating cross-lingual and cross-speaker generalization under label-free conditions. Experimental results demonstrate that the method outperforms current state-of-the-art approaches in both monolingual and multilingual scenarios, producing animations that exhibit more natural and realistic articulatory timing, habitual facial gestures, and temporal coherence.
📝 Abstract
Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.