MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 2

career value

229K/year

🤖 AI Summary

Real-time lip synchronization faces three core challenges: identity consistency, lip motion accuracy, and ultra-low latency. This paper proposes the first lip reenactment paradigm based on a VAE latent space, wherein head-pose-matched reference frame sampling focuses modeling exclusively on lip dynamics, and a multi-scale U-Net enables deep audio-visual feature fusion within the latent space. Furthermore, we quantitatively characterize the relationship between lip-sync loss and input information entropy, providing principled guidance for loss function design. Our method achieves over 30 FPS at 256×256 resolution with negligible startup latency, surpassing state-of-the-art methods in visual fidelity while matching the best-performing approaches in lip-sync accuracy—making it suitable for real-time applications such as live streaming.

Technology Category

Application Category

📝 Abstract

Achieving high-resolution, identity consistency, and accurate lip-speech synchronization in face visual dubbing presents significant challenges, particularly for real-time applications like live video streaming. We propose MuseTalk, which generates lip-sync targets in a latent space encoded by a Variational Autoencoder, enabling high-fidelity talking face video generation with efficient inference. Specifically, we project the occluded lower half of the face image and itself as an reference into a low-dimensional latent space and use a multi-scale U-Net to fuse audio and visual features at various levels. We further propose a novel sampling strategy during training, which selects reference images with head poses closely matching the target, allowing the model to focus on precise lip movement by filtering out redundant information. Additionally, we analyze the mechanism of lip-sync loss and reveal its relationship with input information volume. Extensive experiments show that MuseTalk consistently outperforms recent state-of-the-art methods in visual fidelity and achieves comparable lip-sync accuracy. As MuseTalk supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency, it paves the way for real-time applications.

Problem

Research questions and friction points this paper is trying to address.

Real-time video dubbing with identity consistency and lip sync

Resolving trade-off between visual fidelity and computational cost

Improving lip-sync accuracy and dental details in real-time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training framework for video dubbing

Informative Frame Sampling for temporal alignment

Dynamic Margin Sampling for spatial lip-movement selection

🔎 Similar Papers

LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

2024-10-01arXiv.orgCitations: 1

Style-Preserving Lip Sync via Audio-Aware Style Reference

2024-08-10arXiv.orgCitations: 2

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR