๐ค AI Summary
This work addresses the challenge of efficiently learning whole-body mobile manipulation policies for robots from human demonstrations alone, despite significant morphological discrepancies in perception and action spaces between humans and robots. The authors propose the HoMMI framework, which leverages a first-person-viewโenhanced UMI interface to collect human demonstrations and introduces a cross-morphology hand-eye policy architecture. This architecture comprises morphology-agnostic visual representations, a relaxed head-motion encoding scheme, and a physics-aware whole-body controller that respects physical constraints. By eliminating the need for robot involvement during data collection, the approach enables scalable demonstration acquisition and policy transfer. It is the first to support long-horizon mobile manipulation tasks involving bimanual coordination, navigation, and active perception, substantially improving the efficiency of transferring complex behaviors from human demonstrations to robotic systems.
๐ Abstract
We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io