MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Audio-driven talking face generation faces challenges in long-video synthesis, including identity drift, temporal inconsistency, and limited editing flexibility. To address these, we propose an identity-preserving framework leveraging a single reference image, integrating structured motion priors with a progressive latent fusion strategy to significantly enhance motion coherence and stability over extended sequences. Our method employs a dual-diffusion architecture—ReferenceNet and AnimateNet—incorporating text-guided identity control and motion-aware conditioning to enable fine-grained, zero-shot facial editing without model fine-tuning. Extensive evaluations on multiple benchmarks demonstrate state-of-the-art performance in visual quality, identity fidelity, and lip-sync accuracy. The framework supports high-fidelity, customizable, and temporally coherent talking face video generation for prolonged durations.

Technology Category

Application Category

📝 Abstract

Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.

Problem

Research questions and friction points this paper is trying to address.

Achieving temporal consistency in audio-driven talking face generation

Preserving identity from single reference images without fine-tuning

Reducing motion flickering in long-form video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-shot diffusion framework for customizable talking faces

ReferenceNet preserves identity and enables facial editing

AnimateNet enhances motion coherence with structured priors

🔎 Similar Papers

No similar papers found.