ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing whole-body control methods for humanoid robots often rely on fixed motion primitives or expensive teleoperation data, making it challenging to generate natural, human-like behaviors such as sitting or kicking. This work proposes the first end-to-end framework that learns visuomotor control policies directly from egocentric human videos, without requiring any real-robot teleoperation data. The approach leverages a vision-language model to predict future full-body motions, which are then retargeted and robustly tracked on a physical robot. Experiments on the Unitree G1 platform demonstrate that the proposed method outperforms baseline approaches in terms of motion naturalness and diversity, establishing a new paradigm for efficient, scalable, and teleoperation-free humanoid control.

Technology Category

Application Category

📝 Abstract
Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.
Problem

Research questions and friction points this paper is trying to address.

humanoid control
visuomotor learning
egocentric video
whole-body motion
teleoperation-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

visuomotor control
egocentric video
humanoid robot
vision-language model
motion retargeting
🔎 Similar Papers
No similar papers found.
Haoran Yang
Haoran Yang
Central South University
Graph Neural NetworksData MiningRecommendation Systems
Jiacheng Bao
Jiacheng Bao
ShanghaiTech University
character animationroboticsimage generation
Y
Yucheng Xin
Shanghai AI Laboratory, Tsinghua University
H
Haoming Song
Shanghai AI Laboratory, Shanghai Jiao Tong University
Y
Yuyang Tian
University of Science and Technology of China, Shanghai AI Laboratory
Bin Zhao
Bin Zhao
Northwestern Polytechnical University, Shanghai AI Laboratory
Computer VisionEmbodied Artificial Intelligence
Dong Wang
Dong Wang
Shanghai AI Laboratory
Embodied AIRobot VisionRobot Fundation Model
X
Xuelong Li
TeleAI, China Telecom