🤖 AI Summary
MotionGlot addresses the challenges of cross-modal motion generation—namely, inconsistent multi-dimensional action spaces across heterogeneous agents (e.g., quadrupeds and humans), scarcity of high-quality annotated data, and difficulty in text-motion alignment—by adapting large language model (LLM) training paradigms to motion synthesis. To this end, it introduces: (1) a multi-entity motion-space alignment mechanism; (2) a text-motion joint embedding framework with instruction fine-tuning; and (3) the first directionally annotated quadruped locomotion dataset and a large-scale, scenario-aware human motion prompting corpus. Evaluated on six generation tasks, MotionGlot achieves an average 35.3% improvement over prior methods and demonstrates end-to-end deployment on a physical quadruped robot. This work establishes a novel paradigm for text-driven, general-purpose motion generation across diverse embodied agents.
📝 Abstract
This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motionrelated tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.