MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

๐Ÿ“… 2024-10-14
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 7
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Real-time lip synchronization faces three core challenges: identity consistency, lip motion accuracy, and ultra-low latency. This paper proposes the first lip reenactment paradigm based on a VAE latent space, wherein head-pose-matched reference frame sampling focuses modeling exclusively on lip dynamics, and a multi-scale U-Net enables deep audio-visual feature fusion within the latent space. Furthermore, we quantitatively characterize the relationship between lip-sync loss and input information entropy, providing principled guidance for loss function design. Our method achieves over 30 FPS at 256ร—256 resolution with negligible startup latency, surpassing state-of-the-art methods in visual fidelity while matching the best-performing approaches in lip-sync accuracyโ€”making it suitable for real-time applications such as live streaming.

Technology Category

Application Category

๐Ÿ“ Abstract
Achieving high-resolution, identity consistency, and accurate lip-speech synchronization in face visual dubbing presents significant challenges, particularly for real-time applications like live video streaming. We propose MuseTalk, which generates lip-sync targets in a latent space encoded by a Variational Autoencoder, enabling high-fidelity talking face video generation with efficient inference. Specifically, we project the occluded lower half of the face image and itself as an reference into a low-dimensional latent space and use a multi-scale U-Net to fuse audio and visual features at various levels. We further propose a novel sampling strategy during training, which selects reference images with head poses closely matching the target, allowing the model to focus on precise lip movement by filtering out redundant information. Additionally, we analyze the mechanism of lip-sync loss and reveal its relationship with input information volume. Extensive experiments show that MuseTalk consistently outperforms recent state-of-the-art methods in visual fidelity and achieves comparable lip-sync accuracy. As MuseTalk supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency, it paves the way for real-time applications.
Problem

Research questions and friction points this paper is trying to address.

Real-time video dubbing with identity consistency and lip sync
Resolving trade-off between visual fidelity and computational cost
Improving lip-sync accuracy and dental details in real-time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training framework for video dubbing
Informative Frame Sampling for temporal alignment
Dynamic Margin Sampling for spatial lip-movement selection
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yue Zhang
Lyra Lab, Tencent Music Entertainment
Minhao Liu
Minhao Liu
Princeton University
condensed matter physics and material science
Zhaokang Chen
Zhaokang Chen
Lyra Lab, Tencent Music Entertainment
B
Bin Wu
Lyra Lab, Tencent Music Entertainment
Y
Yubin Zeng
Lyra Lab, Tencent Music Entertainment
C
Chao Zhan
The Chinese University of Hong Kong, Shenzhen
Y
Yingjie He
Lyra Lab, Tencent Music Entertainment
J
Junxin Huang
Lyra Lab, Tencent Music Entertainment
Wenjiang Zhou
Wenjiang Zhou
Peking University, HUST
AI for scienceAtomistic simulationsSuper-Planckian far-field