LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

📅 2024-11-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in audio-driven talking-head video generation—including poor cross-modal temporal alignment, weak portrait identity consistency, and high computational overhead—this paper proposes a modular spatiotemporal attention diffusion Transformer. Methodologically: (1) it introduces a novel dual-path fusion mechanism—Symbiotic Fusion for preserving speaker identity stability and Direct Fusion for ensuring diverse, speech-action synchronized motion generation; (2) it jointly models image, audio, and temporal priors in the latent space under conditional guidance. Experiments demonstrate that the method achieves state-of-the-art performance across multiple benchmarks, significantly improving lip-sync accuracy and temporal coherence while maintaining strong identity fidelity and natural facial expressiveness. Moreover, it attains higher inference efficiency compared to mainstream diffusion-based approaches.

Technology Category

Application Category

📝 Abstract
Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.
Problem

Research questions and friction points this paper is trying to address.

Efficient fusion of multimodal inputs for video synthesis
Maintaining temporal and portrait consistency in long videos
Reducing computational cost while enhancing generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Diffusion Transformer for video synthesis
Deep compression autoencoder for latent representations
Memory bank mechanism for temporal consistency
H
Haojie Zhang
South China University of Technology
Zhihao Liang
Zhihao Liang
South China University of Technology
Computer Vision and Pattern RecognitionMachine Learning
Ruibo Fu
Ruibo Fu
Associate Professor,CASIA
AIGCLMMIntelligent speech interactionDeepfake detection
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
X
Xuefei Liu
Institute of automation, Chinese Academy of Sciences
C
Chenxing Li
AI Lab, Tencent
J
Jianhua Tao
Department of Automation, Tsinghua University
Y
Yaling Liang
South China University of Technology