SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing neural controllers for humanoid robots suffer from limited capacity, poor behavioral generalization, and insufficient training scale. Method: This paper proposes a general-purpose, full-body control foundation model grounded in motion tracking. It introduces a unified action token space and a real-time motion planner to support multimodal inputs—including VR, video, and vision-language modalities—and trains a 42-million-parameter network on over 100 million high-quality motion-capture frames, leveraging dense supervision and human motion priors across 9,000 GPU-hours. Contribution/Results: The model significantly improves motion naturalness and cross-task robustness, generalizing effectively to unseen behaviors. Empirical results demonstrate consistent performance gains with scaling of model size, dataset volume, and compute budget, thereby validating motion tracking as a scalable, high-performing paradigm for humanoid robot control.

Technology Category

Application Category

📝 Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

Problem

Research questions and friction points this paper is trying to address.

Scaling up humanoid control models beyond current limited parameter sizes

Creating natural whole-body movements without manual reward engineering

Developing a universal controller for multiple motion input interfaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaled model capacity with 42M parameters

Trained on 100M frames of motion data

Real-time universal kinematic planner for tasks

🔎 Similar Papers

Hierarchical World Models as Visual Whole-Body Humanoid Controllers