SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI-based video dubbing methods rely on masked training, which disrupts spatiotemporal context in videos, leading to lip motion distortion, facial structural instability, and background inconsistency. To address this, we propose a two-stage mask-free lip-sync framework: (1) an audio-driven, diffusion-based video Transformer for precise mouth region restoration; and (2) a progressive self-correction fine-tuning stage that implicitly disentangles lip motion, identity, and background—eliminating artifacts end-to-end. Our work introduces the first mask-free self-correction paradigm, integrating audio-visual cross-modal alignment, pseudo-paired data generation, and mask-free fine-tuning. Evaluated on real-world scenarios, our method achieves state-of-the-art performance, significantly improving visual fidelity, temporal coherence, identity consistency, and background stability.

Technology Category

Application Category

📝 Abstract
High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.
Problem

Research questions and friction points this paper is trying to address.

Achieves precise audio-lip synchronization in wild videos
Overcomes mask-induced artifacts in facial and background regions
Ensures high visual fidelity and identity preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage diffusion framework for lip-sync
Mask-free tuning to correct facial artifacts
Pseudo-paired data generation for precise editing
🔎 Similar Papers
No similar papers found.
X
Xindi Zhang
Tongyi Lab, Alibaba Group
Dechao Meng
Dechao Meng
PhD candidate, Institute of Computing Technology, Chinese Academy of Science
deep learningcomputer vision
S
Steven Xiao
Tongyi Lab, Alibaba Group
Q
Qi Wang
Tongyi Lab, Alibaba Group
P
Peng Zhang
Tongyi Lab, Alibaba Group
B
Bang Zhang
Tongyi Lab, Alibaba Group