Audio-Driven Universal Gaussian Head Avatars

๐Ÿ“… 2025-09-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key challenges in audio-driven high-fidelity talking avatar synthesis: insufficient identity preservation and difficulty modeling audio-dependent facial dynamicsโ€”such as eyebrow motion, gaze shifts, and intra-oral articulation. To this end, we propose the Universal Head Avatar Prior (UHAP), the first framework enabling cross-subject implicit joint modeling of audio-to-facial geometry and appearance. UHAP enhances identity fidelity via supervision from neutral 3D scans, and integrates multi-view video training, an audio-to-latent mapping network, and a monocular video encoder to support high-fidelity personalized rendering and efficient fine-tuning. Quantitative and perceptual evaluations demonstrate that UHAP surpasses state-of-the-art methods in lip-sync accuracy, image quality, and visual realism. Notably, it is the first general-purpose audio-driven framework capable of faithfully reconstructing subtle expression dynamics at fine granularity.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.
Problem

Research questions and friction points this paper is trying to address.

Creating photorealistic avatars from audio inputs universally
Capturing both geometric and appearance variations from speech
Enabling efficient personalization for new subjects with monocular video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Head Avatar Prior trained on cross-identity videos
Audio inputs mapped to latent space encoding geometry and appearance
Monocular encoder enables efficient personalization for new subjects
๐Ÿ”Ž Similar Papers
No similar papers found.