Monocular and Generalizable Gaussian Talking Head Animation

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of geometric and appearance information and poor generalizability in monocular video-driven 3D Gaussian talking-head animation, this paper proposes the first generic framework that requires neither multi-view supervision nor identity-specific fine-tuning. Methodologically, we introduce a novel depth-guided geometric symmetry augmentation mechanism and design a two-stage symmetric prior parameter prediction module, integrating monocular depth estimation, symmetric point cloud transformation, filtering-based optimization, and 3D Gaussian parameter generation. Our framework learns robust cross-identity 3D facial dynamics solely from monocular training videos, significantly improving facial structural integrity, temporal coherence, and cross-subject generalization. Extensive evaluations on multiple benchmarks demonstrate consistent and substantial improvements over state-of-the-art methods. This work establishes a new paradigm for high-fidelity talking-head synthesis under low-resource conditions.

Technology Category

Application Category

📝 Abstract
In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.
Problem

Research questions and friction points this paper is trying to address.

Enables monocular talking head animation without multi-view data
Generalizes to unseen identities without re-training
Enhances geometric and appearance features using depth information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses depth information for geometric enhancement
Applies symmetry operations for parameter precision
Two-stage strategy predicts visible and non-visible regions
🔎 Similar Papers
No similar papers found.
S
Shengjie Gong
South China University of Technology
H
Haojie Li
South China University of Technology
Jiapeng Tang
Jiapeng Tang
Technical University of Munich
3D ReconstructionComputer VisionGenerative Models
D
Dongming Hu
South China University of Technology
Shuangping Huang
Shuangping Huang
Professor, Electronic and Information Engineering, South China University of Technology
Computer VisionAIGCLLMEmbodied AI
H
Hao Chen
South China University of Technology
T
Tianshui Chen
Guangdong University of Technology
Zhuoman Liu
Zhuoman Liu
PhD Candidate, The Hong Kong Polytechnic University
3D VisionComputer GraphicsGenerative ModelsNeural Simulation