🤖 AI Summary
This paper addresses the challenge of expensive teleoperation data dependency in mobile robot imitation learning. We propose an end-to-end training paradigm that requires no teleoperated demonstrations from mobile robots. Methodologically, we introduce the first framework that jointly leverages human first-person vision–pose data and static-robot offline datasets, enabling cross-modal alignment and co-training to transfer full-body human motion into mobile robot policies. Our key contributions are: (1) eliminating reliance on mobile teleoperation data; (2) enabling generalization across diverse spatial layouts and unseen environments; and (3) achieving consistent performance gains with increasing scale of human demonstration data. Evaluated on three real-world navigation and manipulation tasks, our approach matches or surpasses the success rates of Mobile ALOHA—a teleoperation-based baseline—demonstrating strong efficacy, cross-environment generalizability, and scalability.
📝 Abstract
Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at https://ego-moma.github.io/.