See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-resolution (1024×1024) talking-face videos directly from audio input without requiring source reference images. To this end, we propose an end-to-end speech-conditioned diffusion framework: (i) a Transformer-based discrete codebook models expressive facial dynamics; (ii) a learnable region enhancement module refines lip and facial details; and (iii) a hybrid prior integration—combining statistical facial priors with sample-adaptive weighting—improves generation consistency. Extensive experiments on HDTF, VoxCeleb, and AVSpeech demonstrate that our method achieves state-of-the-art performance in both visual naturalness and lip-sync accuracy (LSE), outperforming existing approaches significantly. To our knowledge, this is the first method to enable high-fidelity, high-resolution talking-face synthesis from speech alone.

Technology Category

Application Category

📝 Abstract
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
Problem

Research questions and friction points this paper is trying to address.

Generating high-resolution talking faces solely from speech input
Enhancing lip synchronization and facial dynamics without source images
Overcoming limitations of appearance-reference dependent synthesis methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-conditioned diffusion model with facial prior
Latent space embedding of expressive facial dynamics
Transformer-based codebook for high-resolution rendering
🔎 Similar Papers
No similar papers found.