Dual Audio-Centric Modality Coupling for Talking Head Generation

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate lip synchronization and low visual quality in audio-driven talking-head generation, this paper proposes a NeRF-based dual-encoding coupled framework. Methodologically, we design a content-aware and dynamic-synchronization dual-stream audio encoder, enabling the first fine-grained disentanglement of audio semantics and facial temporal dynamics. We further introduce a cross-modal synchronization fusion module (CSFM) that explicitly models the alignment between audio signals and facial motion, while supporting both natural speech and text-to-speech (TTS) inputs. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across all major metrics: it significantly reduces lip synchronization error (LSE) and improves visual fidelity, as measured by LPIPS and FID. Notably, the method exhibits strong robustness and high-fidelity rendering under both real speech and TTS-driven audio inputs.

Technology Category

Application Category

📝 Abstract
The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
Problem

Research questions and friction points this paper is trying to address.

Improving audio-driven talking head video generation
Enhancing lip synchronization and visual quality
Integrating audio content and dynamic features effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

NeRF-based framework for talking head generation
Dual encoder captures content and dynamics
Cross-Synchronized Fusion Module enhances synchronization
🔎 Similar Papers
No similar papers found.
A
Ao Fu
School of Computer Science and Engineering, Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, China
Ziqi Ni
Ziqi Ni
Southeast University
Computer VisionGenerative AI
Y
Yi Zhou
School of Computer Science and Engineering, Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, China