Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of generating personalized talking-face videos from a static face image, speech signal, and driving text, with precise lip–speech synchronization. To this end, the authors propose a multimodal fusion framework comprising a multimodal encoder, an innovative multi-entangled latent space modeling mechanism, and a disentangled audiovisual decoder. By jointly modeling identity-specific attributes and cross-modal spatiotemporal consistency within the latent space, the method achieves high-quality, temporally coherent audiovisual synthesis. Experimental results demonstrate that the generated videos outperform existing approaches in terms of naturalness, personalization fidelity, and lip-sync accuracy.

Technology Category

Application Category

📝 Abstract

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

Problem

Research questions and friction points this paper is trying to address.

audio-visual generation

talking face synthesis

latent space

multimodal synthesis

prompt-guided generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-entangled latent space

prompt-guided generation

audio-visual face synthesis