Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scalability bottleneck in high-level humanoid robot control caused by scarce high-quality motion data, this paper introduces Humanoid-Union and SCHUR. Humanoid-Union is the first large-scale, automatically constructed semantic-annotated human motion dataset, integrating cross-source video parsing, motion retargeting, and text-motion alignment. SCHUR is a Transformer-based, scalable end-to-end learning framework enabling joint scaling of data and model capacity. Our work presents the first systematic investigation into how data scale and model capacity jointly affect high-level control performance. Empirical results show a 37% reduction in MPJPE and a 25% improvement in FID over prior methods, demonstrating substantial gains in motion fidelity and diversity. Furthermore, we validate the generalization capability and deployment feasibility of our approach on real humanoid robots.

Technology Category

Application Category

📝 Abstract
Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. To address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37% reconstruction improvement under MPJPE and 25% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.
Problem

Research questions and friction points this paper is trying to address.

Mining raw video data for robot-learnable motion representations
Leveraging human motion data for scalable humanoid robot learning
Exploring impact of data scaling on high-level robot control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous pipeline generates large-scale humanoid motion dataset
Scalable framework explores data impact on high-level control
Achieves improved motion generation and text alignment
🔎 Similar Papers