π€ AI Summary
This work addresses the challenge that existing text-driven humanoid motion generation methods, while producing geometrically plausible motions by relying on human motion priors, often violate physical constraints and thus fail to execute stably on real robots. To bridge this gap, we propose PhyGile, a novel framework that, for the first time, introduces physics-aware prefix guidance directly in the robotβs native 262-dimensional skeletal space, enabling inference-time generation of physically feasible, agile full-body motions without retargeting artifacts. By integrating curriculum-based mixture-of-experts training with unlabeled data post-training, PhyGile significantly enhances generalization. Experiments demonstrate that our method enables real humanoid robots to stably track highly dynamic, complex motions specified by natural language instructions, substantially narrowing the gap between motion generation and real-world execution and expanding the frontier of text-driven humanoid control.
π Abstract
Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.