Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of autonomous navigation and manipulation for humanoid robots in complex human environments, this paper proposes a hand-eye coordinated hierarchical control framework that decouples visual perception from whole-body motion control. Methodologically, it integrates large-scale human motion-capture data with first-person visual data captured by Meta Aria smart glasses, enabling end-to-end co-optimization of navigation, motion planning, and dexterous grasping via modular joint learning. Key contributions include: (1) a cross-modal aligned vision–action representation; and (2) a transferable hierarchical control architecture supporting scene generalization and sample-efficient learning. Evaluated in both simulation and real-world indoor settings, the system achieves an average task success rate exceeding 89% on multi-object delivery tasks, demonstrating significant improvements in environmental adaptability and system scalability.

Technology Category

Application Category

📝 Abstract

We propose Hand-Eye Autonomous Delivery (HEAD), a framework that learns navigation, locomotion, and reaching skills for humanoids, directly from human motion and vision perception data. We take a modular approach where the high-level planner commands the target position and orientation of the hands and eyes of the humanoid, delivered by the low-level policy that controls the whole-body movements. Specifically, the low-level whole-body controller learns to track the three points (eyes, left hand, and right hand) from existing large-scale human motion capture data while high-level policy learns from human data collected by Aria glasses. Our modular approach decouples the ego-centric vision perception from physical actions, promoting efficient learning and scalability to novel scenes. We evaluate our method both in simulation and in the real-world, demonstrating humanoid's capabilities to navigate and reach in complex environments designed for humans.

Problem

Research questions and friction points this paper is trying to address.

Learning humanoid navigation, locomotion, and reaching skills

Decoupling vision perception from physical actions for scalability

Evaluating humanoid capabilities in complex human-designed environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular high-level and low-level policy learning

Learning from human motion and vision data

Decoupling vision perception from physical actions

🔎 Similar Papers

No similar papers found.

Authors to Follow