DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current talking-face generation methods face a fundamental trade-off between spatial controllability and temporal consistency: 3DMM-based approaches ensure temporal coherence but lack fine-grained regional control, whereas diffusion models enable spatial editing yet suffer from temporal jitter. To address this, we propose a semantic-decoupled hierarchical latent diffusion framework. We pioneer the decomposition of 3DMM parameters into three semantically orthogonal variables—lip motion (speech-driven), facial expression, and head pose—enabling independent, interpretable control. We further design a parameter-space-guided hierarchical diffusion architecture with region-aware attention mechanisms. Additionally, we introduce CHDTF, the first high-quality, high-resolution Chinese talking-face dataset. Extensive evaluations demonstrate state-of-the-art performance across lip-sync accuracy, expression naturalness, and temporal consistency. Notably, our method significantly improves Chinese talking-face generation quality under cross-lingual speech driving.

Technology Category

Application Category

📝 Abstract
Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: https://kangweiiliu.github.io/DisentTalk.
Problem

Research questions and friction points this paper is trying to address.

Overcoming incompatible control mechanisms in talking face generation
Resolving semantic entanglement of facial representations in animation
Addressing scarcity of high-quality Chinese talking face datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic disentanglement framework for facial control
Hierarchical latent diffusion in 3DMM space
Region-aware attention for spatial-temporal coherence
🔎 Similar Papers
No similar papers found.
Kangwei Liu
Kangwei Liu
Institute of Information Engineering, Chinese Academy of Sciences
Audio-driven Talking Face GenerationFacial Animation
J
Junwu Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100085, China
Yun Cao
Yun Cao
researcher, tencent
CVGANs
J
Jinlin Guo
Laboratory for Big Data and Decision, School of System Engineering, National University of Defense
X
Xiaowei Yi
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100085, China