π€ AI Summary
Existing text-to-motion generation methods are largely confined to a single species; while cross-species modeling promises improved generalization, morphological disparities often lead to physically implausible motions.
Method: We propose the first unified text-driven motion generation framework for both humans and animals. Our approach introduces UniMo4Dβa large-scale, multi-species 4D motion datasetβand designs a shared skeletal topology with joint representation learning. A morphology-consistency module ensures anatomically plausible cross-species motions. The two-stage architecture employs a conditional graph variational autoencoder to learn a T-pose prior in a shared latent space regularized by morphology-aware loss, followed by masked motion modeling to generate text-conditioned motion embeddings.
Results: Extensive experiments demonstrate significant improvements over state-of-the-art methods on both seen and unseen species, substantially enhancing motion fidelity and cross-species generalization capability.
π Abstract
Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose extbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct extbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.