🤖 AI Summary
Existing motion generation methods are constrained by the scarcity of heterogeneous animal motion data and modeling bottlenecks arising from fixed skeletal templates. To address these limitations, we propose the first text-driven, topology-agnostic animal motion generation framework. Our approach comprises three key components: (1) constructing OmniZoo—a large-scale, multi-species motion dataset encompassing 140 animal species and 33,000 motion sequences; (2) designing a topology-aware skeletal embedding module that jointly encodes arbitrary skeletal geometries and textual semantics into a unified representation space; and (3) integrating autoregressive sequence modeling, multimodal alignment, and joint representation learning. The resulting method generates physically plausible, temporally coherent, and semantically accurate motions. It further enables cross-species motion style transfer and demonstrates strong generalization to unseen skeletal topologies.
📝 Abstract
Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.