Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

πŸ“… 2026-02-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of generating personalized talking-face videos from a static face image, speech signal, and driving text, with precise lip–speech synchronization. To this end, the authors propose a multimodal fusion framework comprising a multimodal encoder, an innovative multi-entangled latent space modeling mechanism, and a disentangled audiovisual decoder. By jointly modeling identity-specific attributes and cross-modal spatiotemporal consistency within the latent space, the method achieves high-quality, temporally coherent audiovisual synthesis. Experimental results demonstrate that the generated videos outperform existing approaches in terms of naturalness, personalization fidelity, and lip-sync accuracy.

Technology Category

Application Category

πŸ“ Abstract
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
Problem

Research questions and friction points this paper is trying to address.

audio-visual generation
talking face synthesis
latent space
multimodal synthesis
prompt-guided generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-entangled latent space
prompt-guided generation
audio-visual face synthesis
cross-modal alignment
talking face generation
πŸ”Ž Similar Papers
No similar papers found.
A
Aashish Chandra
Machine Intelligence Group, Department of CS&IS, BITS Pilani, Hyderabad Campus, India
A
Aashutosh A V
Machine Intelligence Group, Department of CS&IS, BITS Pilani, Hyderabad Campus, India; Georgia Institute of Technology, USA
Abhijit Das
Abhijit Das
BITS Pilani Hyderabad, Dept of CS&IS
Computer VisionPattern RecognitionMachine Learning