JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes an end-to-end dubbing framework based on a joint audio-visual diffusion foundation model to address the challenge of simultaneously preserving speaker identity and achieving lip-sync accuracy in real-world video dubbing. By leveraging lightweight LoRA adapters, the method jointly generates target-language speech and synchronized facial motions. It innovatively introduces multimodal diffusion priors into video dubbing, enabling the model to self-generate multilingual paired data and incorporating a cross-lingual semi-inpainting training strategy to significantly enhance identity preservation and robustness. Experimental results demonstrate that the proposed approach substantially outperforms existing methods in visual fidelity, lip-sync precision, and adaptability to complex dynamic scenes.

Technology Category

Application Category

πŸ“ Abstract
Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
Problem

Research questions and friction points this paper is trying to address.

video dubbing
audio-visual synchronization
speaker identity preservation
lip synchronization
real-world robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual diffusion
video dubbing
LoRA adaptation
multilingual synthesis
lip synchronization
πŸ”Ž Similar Papers
No similar papers found.