🤖 AI Summary
Existing video dubbing methods struggle to simultaneously achieve expressive speech, high acoustic quality, and precise lip-sync accuracy. To address this challenge, this work proposes DiFlowDubber, a novel approach employing a two-stage training framework that transfers knowledge from a pretrained text-to-speech (TTS) model to the video-driven dubbing task and leverages discrete flow matching to generate high-fidelity audio. The method introduces the FaPro module to extract global prosody and stylistic cues from facial expressions and incorporates a Synchronizer module to effectively align textual, visual, and acoustic modalities, thereby enhancing cross-modal synchronization. Experimental results on two widely used benchmarks demonstrate that DiFlowDubber significantly outperforms state-of-the-art methods across multiple metrics, including expressiveness, audio quality, and lip-sync precision.
📝 Abstract
Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.