🤖 AI Summary
This work addresses the limited behavioral diversity and shallow integration of language semantics in existing quadrupedal locomotion datasets, which hinder intuitive and agile human–robot interaction. To overcome this, we introduce QuadFM, the first large-scale, high-fidelity quadruped motion dataset comprising 11,784 motion clips accompanied by 35,352 three-tier textual annotations. We further propose Gen2Control RL, a unified reinforcement learning framework that enables end-to-end, text-driven motion generation and control. Our approach uniquely integrates diverse locomotion patterns, expressive behaviors, and natural language instructions, achieving real-time inference under 500 milliseconds on an NVIDIA Orin edge device. Both simulation and real-world experiments demonstrate the generated motions’ diversity, realism, and physical robustness.
📝 Abstract
Despite significant advances in quadrupedal robotics, a critical gap persists in foundational motion resources that holistically integrate diverse locomotion, emotionally expressive behaviors, and rich language semantics-essential for agile, intuitive human-robot interaction. Current quadruped motion datasets are limited to a few mocap primitives (e.g., walk, trot, sit) and lack diverse behaviors with rich language grounding. To bridge this gap, we introduce Quadruped Foundational Motion (QuadFM) , the first large-scale, ultra-high-fidelity dataset designed for text-to-motion generation and general motion control. QuadFM contains 11,784 curated motion clips spanning locomotion, interactive, and emotion-expressive behaviors (e.g., dancing, stretching, peeing), each with three-layer annotation-fine-grained action labels, interaction scenarios, and natural language commands-totaling 35,352 descriptions to support language-conditioned understanding and command execution.
We further propose Gen2Control RL, a unified framework that jointly trains a general motion controller and a text-to-motion generator, enabling efficient end-to-end inference on edge hardware. On a real quadruped robot with an NVIDIA Orin, our system achieves real-time motion synthesis (<500 ms latency). Simulation and real-world results show realistic, diverse motions while maintaining robust physical interaction. The dataset will be released at https://github.com/GaoLii/QuadFM.