π€ AI Summary
This work addresses the challenge of generating personalized talking-face videos from a static face image, speech signal, and driving text, with precise lipβspeech synchronization. To this end, the authors propose a multimodal fusion framework comprising a multimodal encoder, an innovative multi-entangled latent space modeling mechanism, and a disentangled audiovisual decoder. By jointly modeling identity-specific attributes and cross-modal spatiotemporal consistency within the latent space, the method achieves high-quality, temporally coherent audiovisual synthesis. Experimental results demonstrate that the generated videos outperform existing approaches in terms of naturalness, personalization fidelity, and lip-sync accuracy.
π Abstract
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.