MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot audio-visual-to-audio-visual (AV2AV) multilingual translation suffers from the loss of paralinguistic cues—including speaker identity, facial expressions, and emotion—leading to degraded speech naturalness and lip-sync fidelity. To address this, we propose the first dual-modal conditional flow matching (CFM) framework that explicitly decouples semantic content modeling from paralinguistic feature modeling. Our method innovatively integrates x-vector-enhanced speaker embeddings with emotion-aware facial motion guidance to enable cross-lingual zero-shot transfer. By jointly conditioning on mel-spectrograms and facial dynamics in a cross-modal manner, our approach significantly improves speech naturalness (MOS = 4.12) and lip-sync accuracy (37% reduction in lip-sync error). It establishes new state-of-the-art performance on zero-shot multilingual AV2AV translation, demonstrating superior capability in preserving speaker identity, emotional expressiveness, and temporal alignment across languages.

Technology Category

Application Category

📝 Abstract
Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Maintains speaker consistency in AV2AV translation.
Enhances zero-shot translation with multi-modal guidance.
Preserves emotional nuances and speaker-specific characteristics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional flow matching for AV2AV translation
Dual guidance from audio and visual modalities
Speaker embeddings with x-vectors enhance consistency
🔎 Similar Papers
No similar papers found.