DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video dubbing methods struggle to simultaneously achieve expressive speech, high acoustic quality, and precise lip-sync accuracy. To address this challenge, this work proposes DiFlowDubber, a novel approach employing a two-stage training framework that transfers knowledge from a pretrained text-to-speech (TTS) model to the video-driven dubbing task and leverages discrete flow matching to generate high-fidelity audio. The method introduces the FaPro module to extract global prosody and stylistic cues from facial expressions and incorporates a Synchronizer module to effectively align textual, visual, and acoustic modalities, thereby enhancing cross-modal synchronization. Experimental results on two widely used benchmarks demonstrate that DiFlowDubber significantly outperforms state-of-the-art methods across multiple metrics, including expressiveness, audio quality, and lip-sync precision.

Technology Category

Application Category

📝 Abstract
Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.
Problem

Research questions and friction points this paper is trying to address.

video dubbing
speech-lip synchronization
prosody modeling
cross-modal alignment
expressive speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Flow Matching
Cross-Modal Alignment
Video Dubbing
Prosody Modeling
Lip-Speech Synchronization
🔎 Similar Papers
No similar papers found.