JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing video-based digital human models struggle to simultaneously achieve high expressiveness and multimodal alignment when handling complex textual instructions—such as full-body motions, dynamic camera movements, background transitions, and human-object interactions. To address this challenge, this work proposes JoyAvatar, a novel framework that transfers text-driven control capabilities from foundation models via dual-teacher distillation and introduces a denoising timestep-aware temporal modulation mechanism for dynamic multimodal conditioning. This approach jointly optimizes textual controllability, audio-visual synchronization, and motion naturalness, effectively mitigating conflicts among heterogeneous signals. Evaluated on the GSB benchmark, JoyAvatar outperforms state-of-the-art models including Omnihuman-1.5 and KlingAvatar 2.0, demonstrating robust performance in complex scenarios such as multi-character dialogues and non-human avatars while preserving precise lip-sync accuracy and identity consistency.

Technology Category

Application Category

📝 Abstract

Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.

Problem

Research questions and friction points this paper is trying to address.

expressive avatars

text-audio conditioning

full-body motion

dynamic camera trajectory

human-object interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

twin-teacher training

dynamic multimodal conditioning

text-audio harmonization