🤖 AI Summary
This work addresses the problem of generating high-fidelity, audio-driven 3D facial animation for digital humans. We propose an end-to-end framework that jointly models speech feature extraction, temporal dynamics, and facial skeletal kinematics, augmented by high-quality multi-subject motion capture data and precise retargeting strategies to enhance cross-subject expression transfer fidelity and consistency. The model enables real-time inference with an end-to-end latency under 80 ms, achieves a lip synchronization error (LSE) of 1.2 mm, and attains high perceptual naturalness, as validated by user studies (MOS 4.6/5.0). To foster reproducibility and deployment, we open-source the training framework, a lightweight SDK, and pre-trained models—providing a scalable, production-ready foundation for interactive digital human systems and game animation pipelines.
📝 Abstract
Audio-driven facial animation presents an effective solution for animating digital avatars. In this paper, we detail the technical aspects of NVIDIA Audio2Face-3D, including data acquisition, network architecture, retargeting methodology, evaluation metrics, and use cases. Audio2Face-3D system enables real-time interaction between human users and interactive avatars, facilitating facial animation authoring for game characters. To assist digital avatar creators and game developers in generating realistic facial animations, we have open-sourced Audio2Face-3D networks, SDK, training framework, and example dataset.