Versatile Multimodal Controls for Whole-Body Talking Human Animation

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to generate full-body talking animations driven jointly by audio and text from a single portrait image. This paper proposes the first text-controllable, audio-driven full-body motion generation framework. First, we introduce a codebook-to-pose translation module to enhance motion naturalness. Second, we design a multimodal video diffusion model that fuses Whisper audio embeddings, CLIP text encodings, and 2D DWpose-guided motion representations for high-fidelity animation synthesis. Our method accepts either headshot or full-body portrait inputs, significantly improving visual quality, identity preservation, and lip-sync accuracy. Notably, it enables fine-grained, text-specified full-body gestures—such as “waving” or “nodding”—for the first time. Quantitative and qualitative evaluations demonstrate state-of-the-art performance across multiple metrics, including FID, LSE, and SyncNet score.

Technology Category

Application Category

📝 Abstract
Human animation from a single reference image shall be flexible to synthesize whole-body motion for either a headshot or whole-body portrait, where the motions are readily controlled by audio signal and text prompts. This is hard for most existing methods as they only support producing pre-specified head or half-body motion aligned with audio inputs. In this paper, we propose a versatile human animation method, i.e., VersaAnimator, which generates whole-body talking human from arbitrary portrait images, not only driven by audio signal but also flexibly controlled by text prompts. Specifically, we design a text-controlled, audio-driven motion generator that produces whole-body motion representations in 3D synchronized with audio inputs while following textual motion descriptions. To promote natural smooth motion, we propose a code-pose translation module to link VAE codebooks with 2D DWposes extracted from template videos. Moreover, we introduce a multi-modal video diffusion that generates photorealistic human animation from a reference image according to both audio inputs and whole-body motion representations. Extensive experiments show that VersaAnimator outperforms existing methods in visual quality, identity preservation, and audio-lip synchronization.
Problem

Research questions and friction points this paper is trying to address.

Generates whole-body human animation from single images
Controls motion using audio signals and text prompts
Improves visual quality and audio-lip synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-controlled, audio-driven motion generator
Code-pose translation module for smooth motion
Multi-modal video diffusion for photorealistic animation
🔎 Similar Papers
No similar papers found.
Z
Zheng Qin
Xi’an Jiaotong University
R
Ruobing Zheng
Ant Group
Yabing Wang
Yabing Wang
Xi’an Jiaotong University
multimodal learning
T
Tianqi Li
Ant Group
Z
Zixin Zhu
University at Buffalo
Minghui Yang
Minghui Yang
Ant Group
NLPDialogueGraph3DV
M
Ming Yang
Ant Group
L
Le Wang
Xi’an Jiaotong University