JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work proposes an end-to-end dubbing framework based on a joint audio-visual diffusion foundation model to address the challenge of simultaneously preserving speaker identity and achieving lip-sync accuracy in real-world video dubbing. By leveraging lightweight LoRA adapters, the method jointly generates target-language speech and synchronized facial motions. It innovatively introduces multimodal diffusion priors into video dubbing, enabling the model to self-generate multilingual paired data and incorporating a cross-lingual semi-inpainting training strategy to significantly enhance identity preservation and robustness. Experimental results demonstrate that the proposed approach substantially outperforms existing methods in visual fidelity, lip-sync precision, and adaptability to complex dynamic scenes.

Technology Category

Application Category

📝 Abstract

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

Problem

Research questions and friction points this paper is trying to address.

video dubbing

audio-visual synchronization

speaker identity preservation

lip synchronization

real-world robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual diffusion

video dubbing

LoRA adaptation