QuadFM: Foundational Text-Driven Quadruped Motion Dataset for Generation and Control

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited behavioral diversity and shallow integration of language semantics in existing quadrupedal locomotion datasets, which hinder intuitive and agile human–robot interaction. To overcome this, we introduce QuadFM, the first large-scale, high-fidelity quadruped motion dataset comprising 11,784 motion clips accompanied by 35,352 three-tier textual annotations. We further propose Gen2Control RL, a unified reinforcement learning framework that enables end-to-end, text-driven motion generation and control. Our approach uniquely integrates diverse locomotion patterns, expressive behaviors, and natural language instructions, achieving real-time inference under 500 milliseconds on an NVIDIA Orin edge device. Both simulation and real-world experiments demonstrate the generated motions’ diversity, realism, and physical robustness.

Technology Category

Application Category

📝 Abstract
Despite significant advances in quadrupedal robotics, a critical gap persists in foundational motion resources that holistically integrate diverse locomotion, emotionally expressive behaviors, and rich language semantics-essential for agile, intuitive human-robot interaction. Current quadruped motion datasets are limited to a few mocap primitives (e.g., walk, trot, sit) and lack diverse behaviors with rich language grounding. To bridge this gap, we introduce Quadruped Foundational Motion (QuadFM) , the first large-scale, ultra-high-fidelity dataset designed for text-to-motion generation and general motion control. QuadFM contains 11,784 curated motion clips spanning locomotion, interactive, and emotion-expressive behaviors (e.g., dancing, stretching, peeing), each with three-layer annotation-fine-grained action labels, interaction scenarios, and natural language commands-totaling 35,352 descriptions to support language-conditioned understanding and command execution. We further propose Gen2Control RL, a unified framework that jointly trains a general motion controller and a text-to-motion generator, enabling efficient end-to-end inference on edge hardware. On a real quadruped robot with an NVIDIA Orin, our system achieves real-time motion synthesis (<500 ms latency). Simulation and real-world results show realistic, diverse motions while maintaining robust physical interaction. The dataset will be released at https://github.com/GaoLii/QuadFM.
Problem

Research questions and friction points this paper is trying to address.

quadruped motion
text-to-motion
motion dataset
human-robot interaction
language grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

QuadFM
text-to-motion generation
quadruped motion dataset
Gen2Control RL
language-conditioned control
🔎 Similar Papers
No similar papers found.
L
Li Gao
AMAP, Alibaba Group
F
Fuzhi Yang
AMAP, Alibaba Group
J
Jianhui Chen
AMAP, Alibaba Group
L
Liu Liu
AMAP, Alibaba Group
Y
Yao Zheng
AMAP, Alibaba Group
Yang Cai
Yang Cai
Professor of Computer Science and Economics, Yale University
Theoretical Computer ScienceAlgorithmic Game TheoryMechanism DesignLearning
Z
Ziqiao Li
AMAP, Alibaba Group