🤖 AI Summary
Existing whole-body control methods for humanoid robots often rely on fixed motion primitives or expensive teleoperation data, making it challenging to generate natural, human-like behaviors such as sitting or kicking. This work proposes the first end-to-end framework that learns visuomotor control policies directly from egocentric human videos, without requiring any real-robot teleoperation data. The approach leverages a vision-language model to predict future full-body motions, which are then retargeted and robustly tracked on a physical robot. Experiments on the Unitree G1 platform demonstrate that the proposed method outperforms baseline approaches in terms of motion naturalness and diversity, establishing a new paradigm for efficient, scalable, and teleoperation-free humanoid control.
📝 Abstract
Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.