EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limited generalization of mobile manipulation in humanoid robots operating in real-world complex environments, primarily due to scarce robot-collected data. To overcome this challenge, the authors propose a cross-domain policy learning framework that integrates large-scale first-person human demonstration videos with a small amount of robot data. By establishing a systematic human-to-humanoid alignment mechanism that does not require robot participation—leveraging viewpoint alignment, unified action space mapping, and joint vision-language-action training—the approach effectively bridges morphological and perspective discrepancies. Experimental results demonstrate a 51% performance improvement over baselines trained solely on robot data in real-world settings, with significantly enhanced generalization to unseen scenarios. These findings validate the scalability and efficacy of leveraging abundant human demonstration data for skill transfer to humanoid robots.

Technology Category

Application Category

📝 Abstract

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.

Problem

Research questions and friction points this paper is trying to address.

loco-manipulation

egocentric demonstration

humanoid robot

embodiment gap

in-the-wild

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric demonstration

humanoid locomotion

view alignment