π€ AI Summary
Existing whole-body control of humanoid robots suffers from a lack of large-scale, diverse, and physically plausible language-motion paired data. To address this gap, this work proposes a simulation-based interactive data generation system that integrates composable motion primitives with a low-level whole-body controller. The system features a templated, multi-style language annotation engine supporting dual-mode editing via keyboard input and timeline manipulation. Implemented in MuJoCo, it generates physically realistic, semantically rich full-body motion trajectories paired with natural language descriptions at 50 Hz. The project has been open-sourced, substantially expanding both the scale and diversity of data available for language-guided humanoid robot learning.
π Abstract
Training language-conditioned whole-body controllers for humanoid robots requires large-scale datasets pairing motion trajectories with natural-language descriptions.Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible.Therefore, we present CLAW, an interactive web-based pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW treats the motion modes of a kinematic planner as composable building blocks, each parameterized by movement, heading, speed, pelvis height and duration, and provides two browser-based interfaces -- a real-time keyboard mode and a timeline-based sequence editor -- for exploratory and batch data collection. A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50Hz. Simultaneously, a deterministic template-based annotation engine generates diverse natural-language descriptions at multiple stylistic registers for every segment and for the full trajectory. We release the system as open source to support scalable generation of language-motion paired data for humanoid robot learning.