🤖 AI Summary
This work addresses three key challenges in audio-driven human animation: poor character identity consistency in dynamic video, imprecise audio-emotion alignment, and difficulty in multi-character co-animation. We propose the first generative framework enabling high-fidelity, real-time, multi-character conversational animation. Methodologically: (1) we introduce image-injected character modeling to ensure cross-frame identity consistency; (2) we design an Audio Emotion Migration (AEM) module and a Facial-Aware Audio adapter (FAA) for fine-grained, emotion-controllable audio-visual alignment; (3) we develop a Multimodal Diffusion Transformer (MM-DiT) that jointly integrates implicit facial masks, cross-modal attention-based audio injection, and emotion-reference image transfer. Our method achieves significant improvements over state-of-the-art methods on multiple benchmarks and a newly constructed in-the-wild dataset, marking the first demonstration of dynamic, immersive, high-quality video generation with independent audio-driven animation for multiple characters.
📝 Abstract
Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.