MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations of multimodal diffusion models for speech-driven talking head generation—namely, low inference efficiency, prominent visual artifacts, and distortions in facial expressions and head motion. We propose an efficient diffusion framework that jointly models parameter space and employs flow matching. Methodologically, we construct a disentangled VAE latent space to separately encode identity, expression, pose, and texture; design a coarse-to-fine multimodal diffusion architecture integrating speech, text, and motion priors; and introduce cross-modal feature alignment and joint flow matching to ensure spatiotemporal consistency among expression, speech, and head motion. Experiments on LRS3 and VoxCeleb2 demonstrate substantial improvements: 21.3% reduction in FID, 18.7% reduction in LPIPS, and 3.2× faster inference, alongside enhanced motion naturalness, diversity, photorealism, and practical utility over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of digital humans and the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation and generalization capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts, which arise from the implicit latent space of Variational Auto-Encoders (VAE), complicating the diffusion process; 2) authentic facial expressions and head movements, resulting from insufficient multi-modal information interaction. In this paper, MoDA handle these challenges by 1) defines a joint parameter space to bridge motion generation and neural rendering, and leverages flow matching to simplify the diffusion learning process; 2) introduces a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, ultimately enhancing overall facial expressiveness. Subsequently, a coarse-to-fine fusion strategy is adopted to progressively integrate different modalities, ensuring effective integration across feature spaces. Experimental results demonstrate that MoDA significantly improves video diversity, realism, and efficiency, making it suitable for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Inefficient inference and visual artifacts in diffusion-based talking head generation
Insufficient multi-modal interaction causing unnatural facial expressions
Challenges in integrating motion, audio, and conditions for realistic output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint parameter space bridges motion and rendering
Flow matching simplifies diffusion learning process
Multi-modal diffusion enhances facial expressiveness
🔎 Similar Papers
No similar papers found.
X
Xinyang Li
Xunguang Team, DAMO Academy, Alibaba Group; Zhejiang University
G
Gen Li
Zhejiang University
Zhihui Lin
Zhihui Lin
Tsinghua University, China
Machine LearningDeep LearningVideo GenerationSegmentation
Yichen Qian
Yichen Qian
Alibaba DAMO Academy
Computer VisionFace and GestureGenerative Adversarial Networks
G
Gongxin Yao
Zhejiang University
W
Weinan Jia
Xunguang Team, DAMO Academy, Alibaba Group
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
F
Fan Wang
Xunguang Team, DAMO Academy, Alibaba Group; Hupan Lab