StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual dubbing faces two key challenges: (1) audio-driven approaches struggle to model speaker-specific lip articulation habits, causing identity mismatches; and (2) blind inpainting under occlusions (e.g., hands, microphones) often introduces visual artifacts. This paper proposes the first diffusion-based framework integrating personalized lip-motion modeling with occlusion-robust synthesis. Methodologically, we design a lip-habit modulation mechanism to capture individual articulatory characteristics and introduce an occlusion-aware training strategy for natural restoration. Built upon Stable Diffusion, our architecture unifies lip-audio synchronization modeling, hybrid Mamba-Transformer temporal modeling, and a lightweight occlusion-inpainting module. Experiments demonstrate significant improvements over state-of-the-art methods in lip motion distance (LMD), audio-visual synchronization (SyncNet score), and video fidelity (FVD). The framework supports high-resolution generation, reduces training cost by 32%, and improves stability under occlusions by 41%.

Technology Category

Application Category

📝 Abstract
The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.
Problem

Research questions and friction points this paper is trying to address.

Generating speaker-specific lip movements from audio
Handling occlusions like microphones without visual artifacts
Improving training efficiency for diffusion-based visual dubbing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lip-habit-modulated mechanism modeling speaker-specific dynamics
Occlusion-aware training strategy for robust inpainting
Hybrid Mamba-Transformer architecture enhancing training efficiency
🔎 Similar Papers
No similar papers found.
Liyang Chen
Liyang Chen
Tsinghua University
Multimodal Video GenerationSpeech Synthesis
T
Tianze Zhou
X
Xu He
B
Boshi Tang
Z
Zhiyong Wu
Y
Yang Huang
Y
Yang Wu
Z
Zhongqian Sun
W
Wei Yang
H
Helen Meng