🤖 AI Summary
This work addresses the limitations of existing audio-driven talking face generation methods, which often suffer from insufficient accuracy and efficiency in modeling fine-grained mouth motion details. To overcome this, the authors propose a novel keypoint representation mechanism that integrates blink embeddings with hash grid encoding, coupled with a dynamic keypoint Transformer. This architecture injects audio features as residual terms into a dynamic neural radiance field (Dynamic NeRF), enabling high-fidelity facial animation with strong audio-visual synchronization. The approach significantly enhances the naturalness and expressiveness of both lip movements and overall facial expressions. Experimental results demonstrate that the proposed framework outperforms current state-of-the-art methods in terms of generation quality and detail fidelity.
📝 Abstract
Dynamic Neural Radiance Fields (NeRF) have demonstrated impressive success in generating high-fidelity 3D models of talking portraits. Despite the progress in the rendering speed and generation quality, there are still challenges on accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have demonstrated the superiority of our approach to existing state-of-the-art methods.