HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses three key challenges in audio-driven human animation: poor character identity consistency in dynamic video, imprecise audio-emotion alignment, and difficulty in multi-character co-animation. We propose the first generative framework enabling high-fidelity, real-time, multi-character conversational animation. Methodologically: (1) we introduce image-injected character modeling to ensure cross-frame identity consistency; (2) we design an Audio Emotion Migration (AEM) module and a Facial-Aware Audio adapter (FAA) for fine-grained, emotion-controllable audio-visual alignment; (3) we develop a Multimodal Diffusion Transformer (MM-DiT) that jointly integrates implicit facial masks, cross-modal attention-based audio injection, and emotion-reference image transfer. Our method achieves significant improvements over state-of-the-art methods on multiple benchmarks and a newly constructed in-the-wild dataset, marking the first demonstration of dynamic, immersive, high-quality video generation with independent audio-driven animation for multiple characters.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.
Problem

Research questions and friction points this paper is trying to address.

Generating dynamic videos with character consistency
Aligning character emotions precisely with audio
Enabling multi-character audio-driven animation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Character image injection module enhances consistency
Audio Emotion Module enables precise emotion control
Face-Aware Audio Adapter supports multi-character animation
🔎 Similar Papers