X-MoGen: Unified Motion Generation across Humans and Animals

πŸ“… 2025-08-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing text-to-motion generation methods are largely confined to a single species; while cross-species modeling promises improved generalization, morphological disparities often lead to physically implausible motions. Method: We propose the first unified text-driven motion generation framework for both humans and animals. Our approach introduces UniMo4Dβ€”a large-scale, multi-species 4D motion datasetβ€”and designs a shared skeletal topology with joint representation learning. A morphology-consistency module ensures anatomically plausible cross-species motions. The two-stage architecture employs a conditional graph variational autoencoder to learn a T-pose prior in a shared latent space regularized by morphology-aware loss, followed by masked motion modeling to generate text-conditioned motion embeddings. Results: Extensive experiments demonstrate significant improvements over state-of-the-art methods on both seen and unseen species, substantially enhancing motion fidelity and cross-species generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose extbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct extbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.
Problem

Research questions and friction points this paper is trying to address.

Unified motion generation for humans and animals
Addressing morphological differences across species
Improving motion plausibility with cross-species modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage architecture with variational autoencoder
Morphological loss for shared latent space
Large-scale dataset with unified skeletal topology
πŸ”Ž Similar Papers
No similar papers found.
X
Xuan Wang
Zhejiang University
Kai Ruan
Kai Ruan
Gaoling School of Artificial Intelligence, Renmin University of China
AI for ScienceSymbolic regression
L
Liyang Qian
Zhejiang University
Z
Zhizhi Guo
Institute of Artificial Intelligence (TeleAI), China Telecom
C
Chang Su
Zhejiang University
Gaoang Wang
Gaoang Wang
Zhejiang University / University of Illinois Urbana-Champaign Institute
Embodied AgentComputer VisionMachine Learning