GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

Current audio-driven 3D talking avatar methods suffer from a fundamental trade-off between speed and stability: high-fidelity approaches (e.g., diffusion models) incur prohibitive inference latency, while real-time methods (e.g., Gaussian splatting) exhibit temporal jitter and video artifacts due to inaccurate facial tracking and inconsistent Gaussian mapping. To address this, we propose the first audio-driven Gaussian splatting framework explicitly constrained by 3D Morphable Models (3DMMs). Our method employs an audio-conditional Transformer to predict 3DMM parameters, enabling differentiable Gaussian rendering and joint monocular video–audio modeling. This design explicitly enforces identity consistency and temporal stability. Experiments demonstrate that our approach achieves real-time inference (≥30 FPS) while significantly suppressing jitter and flickering—setting new state-of-the-art performance in LPIPS, FID, and human perceptual evaluation.

Technology Category

Application Category

📝 Abstract

Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

Problem

Research questions and friction points this paper is trying to address.

Generates real-time, stable 3D talking heads from audio

Addresses temporal instability in Gaussian Splatting for avatars

Uses 3D Morphable Models and transformers for consistent mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mapping Gaussian Splatting with 3D Morphable Models

Transformer-based audio-driven parameter prediction

Real-time generation from monocular video and audio

🔎 Similar Papers

No similar papers found.