JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-based digital human models struggle to simultaneously achieve high expressiveness and multimodal alignment when handling complex textual instructions—such as full-body motions, dynamic camera movements, background transitions, and human-object interactions. To address this challenge, this work proposes JoyAvatar, a novel framework that transfers text-driven control capabilities from foundation models via dual-teacher distillation and introduces a denoising timestep-aware temporal modulation mechanism for dynamic multimodal conditioning. This approach jointly optimizes textual controllability, audio-visual synchronization, and motion naturalness, effectively mitigating conflicts among heterogeneous signals. Evaluated on the GSB benchmark, JoyAvatar outperforms state-of-the-art models including Omnihuman-1.5 and KlingAvatar 2.0, demonstrating robust performance in complex scenarios such as multi-character dialogues and non-human avatars while preserving precise lip-sync accuracy and identity consistency.

Technology Category

Application Category

📝 Abstract
Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.
Problem

Research questions and friction points this paper is trying to address.

expressive avatars
text-audio conditioning
full-body motion
dynamic camera trajectory
human-object interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

twin-teacher training
dynamic multimodal conditioning
text-audio harmonization
full-body avatar animation
denoising timestep modulation
🔎 Similar Papers
No similar papers found.
R
Ruikui Wang
JD Technology
J
Jinheng Feng
JD Technology
L
Lang Tian
JD Technology
H
Huaishao Luo
JD Technology
C
Chaochao Li
JD Technology
L
Liangbo Zhou
JD Technology
Huan Zhang
Huan Zhang
Unknown affiliation
Youzheng Wu
Youzheng Wu
JD AI Research, JD.COM
Natural Language ProcessingDialogueSpeech RecognitionDeep Learning
Xiaodong He
Xiaodong He
AI Lab, JD.com; IEEE/CAAI Fellow
natural language processingmultimodal vision-and-languagedeep learning