GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits

📅 2023-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in speech-driven talking-head video generation: imprecise emotional control, unnatural emotion transitions, and insufficient motion diversity. To this end, we propose a high-fidelity, fine-grained emotionally controllable synthesis framework. Methodologically: (1) We introduce a novel continuous disentangled latent space generator based on Gaussian Mixture Models (GMMs), enabling clean separation of identity, facial expression, and head pose; (2) We design a large-motion generator leveraging normalizing flows to enhance naturalness and diversity of head movements and blinking; (3) We construct a learnable emotion mapping network supporting personalized emotion conditioning and cross-emotion smooth interpolation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in image quality, visual realism, lip-sync accuracy, emotion classification accuracy, and motion diversity. It enables fine-grained emotional editing and dynamically continuous emotional transitions.
📝 Abstract
Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.
Problem

Research questions and friction points this paper is trying to address.

Achieving personalized and precise emotion control in talking videos
Ensuring smooth transitions between different emotional states
Generating diverse and realistic head and eye movements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian mixture-based expression generator for flexible emotion control
Normalizing flow-based motion generator for diverse movements
Personalized emotion-guided head generator with mapping network
🔎 Similar Papers
No similar papers found.
Y
Yibo Xia
School of Astronautics, Beihang University, Beijing, 100191, P.R. China
L
Lizhen Wang
Department of Automation, Tsinghua University, Beijing 100084, P.R.China
Xiang Deng
Xiang Deng
Scale AI
Machine LearningNLPKnowledge GraphsSemantic Web
Xiaoyan Luo
Xiaoyan Luo
Beihang university
computer vision
Yebin Liu
Yebin Liu
Professor, Tsinghua University
Computer GraphicsComputational Photography3D VisionDigital Humans