3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing methods for 3D talking head generation struggle to simultaneously preserve speaker identity, achieve accurate lip synchronization, convey expressive emotions, and produce natural head dynamics, primarily due to data scarcity, limited audio representations, and insufficient controllability. This work proposes a unified framework that jointly models identity, lip motion, emotion, and head pose for the first time. It alleviates identity data scarcity through curated 2D-to-3D data augmentation, enriches audio representation with frame-level amplitude and emotion-aware features, and introduces a flow-matching-based Transformer to drive both facial and head dynamics. The framework further supports stylized control via prompt conditioning. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in identity fidelity, lip-sync accuracy, emotional expressiveness, and motion naturalness.

Technology Category

Application Category

📝 Abstract

Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

Problem

Research questions and friction points this paper is trying to address.

3D talking avatar

lip sync

emotion expression

spatial dynamics

identity preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D talking avatar

identity generalization

audio-rich representation