🤖 AI Summary
Existing neural controllers for humanoid robots suffer from limited capacity, poor behavioral generalization, and insufficient training scale. Method: This paper proposes a general-purpose, full-body control foundation model grounded in motion tracking. It introduces a unified action token space and a real-time motion planner to support multimodal inputs—including VR, video, and vision-language modalities—and trains a 42-million-parameter network on over 100 million high-quality motion-capture frames, leveraging dense supervision and human motion priors across 9,000 GPU-hours. Contribution/Results: The model significantly improves motion naturalness and cross-task robustness, generalizing effectively to unseen behaviors. Empirical results demonstrate consistent performance gains with scaling of model size, dataset volume, and compute budget, thereby validating motion tracking as a scalable, high-performing paradigm for humanoid robot control.
📝 Abstract
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.