Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding

📅 2024-12-02
🏛️ 2024 IEEE Smart World Congress (SWC)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing audio-driven talking face generation methods, which often suffer from insufficient accuracy and efficiency in modeling fine-grained mouth motion details. To overcome this, the authors propose a novel keypoint representation mechanism that integrates blink embeddings with hash grid encoding, coupled with a dynamic keypoint Transformer. This architecture injects audio features as residual terms into a dynamic neural radiance field (Dynamic NeRF), enabling high-fidelity facial animation with strong audio-visual synchronization. The approach significantly enhances the naturalness and expressiveness of both lip movements and overall facial expressions. Experimental results demonstrate that the proposed framework outperforms current state-of-the-art methods in terms of generation quality and detail fidelity.

Technology Category

Application Category

📝 Abstract
Dynamic Neural Radiance Fields (NeRF) have demonstrated impressive success in generating high-fidelity 3D models of talking portraits. Despite the progress in the rendering speed and generation quality, there are still challenges on accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have demonstrated the superiority of our approach to existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Talking Face Generation
Mouth Movement
Audio-Driven
Neural Radiance Fields
Facial Animation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blink Embedding
Hash Grid Landmarks Encoding
Dynamic Neural Radiance Fields
Audio-Driven Talking Face
Dynamic Landmark Transformer
🔎 Similar Papers
No similar papers found.
Yuhui Zhang
Yuhui Zhang
Stanford University
Machine LearningComputer VisionNatural Language ProcessingBiotech
Hui Yu
Hui Yu
Professor of Visual and Cognitive Computing, University of Glasgow
Visual ComputingCognitive ComputingSocial RobotParallel Intelligence
W
Wei Liang
Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; School of Psychology and Neuroscience, University of Glasgow, 62 Hillhead Street, Glasgow G12 8QB, Scotland, UK
S
Sunjie Zhang
Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China