PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

📅 2024-12-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking face methods primarily focus on lip-sync accuracy while neglecting visual quality, identity fidelity, and generalization capability. To address these limitations, we propose a latent diffusion-based two-module framework—IdentityNet and AnimateNet—that generates high-fidelity, identity-consistent, and naturally animated talking face videos from only a single reference image and an input audio clip. The framework further supports text-prompted facial expression and stylistic editing. Our key innovation is a decoupled cross-modal attention mechanism that jointly integrates audio encoding, frozen identity features, text-guided cross-attention, and explicit motion modeling. Evaluated on a newly constructed multi-dimensional benchmark, our method significantly outperforms state-of-the-art approaches, demonstrating zero-shot generalization and fine-grained creative control. It is readily applicable to practical scenarios including virtual avatars, educational tools, and accessible human-computer interaction.

Technology Category

Application Category

📝 Abstract
Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Generating customizable one-shot audio-driven talking faces
Overcoming limitations in visual quality and generalization
Reducing reliance on reference-style videos for generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion framework for identity preservation
Decoupled cross-attention enables text-guided control
One-shot generation without reference-style videos
🔎 Similar Papers
No similar papers found.
Fatemeh Nazarieh
Fatemeh Nazarieh
University of Surrey
Generative AIComputer VisionMachine LearningPattern Recognition
Z
Zhenhua Feng
School of Artificial Intelligence and Computer Science, Jiangnan University, China
Diptesh Kanojia
Diptesh Kanojia
Senior Lecturer at University of Surrey | Institute for People-Centred AI
Natural Language ProcessingArtificial Intelligence
M
Muhammad Awais
Centre for Vision, Speech and Signal Processing, University of Surrey, UK; Institute for People-Centred AI, University of Surrey, UK
Josef Kittler
Josef Kittler
University of Surrey
engineering