A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing speech-driven facial animation (TFS) models suffer from poor lip-sync accuracy and rigid facial expressions on non-English languages, primarily due to English-centric training data biases and insufficient cross-lingual generalization capability. To address this, we propose Multilingual Experts (MuEx), a novel framework featuring two key innovations: (1) a phoneme-viseme alignment mechanism (PV-Align) that establishes cross-modal correspondence between acoustic and visual units, and (2) a phoneme-guided Mixture-of-Experts (PG-MoE) architecture that leverages phonemes and visemes as universal, language-agnostic representations for zero-shot multilingual generalization. We introduce MTFB, a high-quality multilingual benchmark spanning 12 languages and 95.04 hours of synchronized audiovisual data. Extensive experiments demonstrate that MuEx achieves state-of-the-art performance across all seen languages—significantly improving lip motion precision and visual naturalness—while exhibiting strong zero-shot transfer capability to unseen languages.

Technology Category

Application Category

📝 Abstract

Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

Problem

Research questions and friction points this paper is trying to address.

Achieving multilingual talking face synthesis from audio input

Addressing cross-language mouth shape and expression inconsistencies

Establishing phoneme-viseme alignment for audiovisual synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phoneme-Guided Mixture-of-Experts architecture for multilingual synthesis

Phoneme-Viseme Alignment Mechanism ensures cross-modal synchronization

Multilingual dataset with 12 languages enables robust training

🔎 Similar Papers

No similar papers found.