EgoZero: Robot Learning from Smart Glasses

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Robot learning still lags significantly behind human dexterity in real-world manipulation and fails to leverage the vast corpus of uncurated, first-person human interaction videos captured in-the-wild. This paper introduces the first learning framework that trains robust, deployable robotic manipulation policies solely from unlabeled, in-the-wild, egocentric human demonstration videos—specifically from Project Aria—and requires no robot teleoperation data. Methodologically, it integrates vision-action self-supervised alignment, state compression, end-to-end policy distillation, and sim2real transfer. Key contributions include: (1) the first method to extract executable actions from morphology-agnostic, unlabeled in-the-wild video; (2) a morphology-agnostic state representation enabling closed-loop policy generalization across morphologies, spatial configurations, and semantic tasks; and (3) zero-shot transfer to a Franka Panda arm on seven diverse manipulation tasks, achieving a 70% average success rate with only 20 minutes of human video per task.

Technology Category

Application Category

📝 Abstract

Despite recent progress in general purpose robotics, robot policies still lag far behind basic human capabilities in the real world. Humans interact constantly with the physical world, yet this rich data resource remains largely untapped in robot learning. We propose EgoZero, a minimal system that learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses, $ extbf{and zero robot data}$. EgoZero enables: (1) extraction of complete, robot-executable actions from in-the-wild, egocentric, human demonstrations, (2) compression of human visual observations into morphology-agnostic state representations, and (3) closed-loop policy learning that generalizes morphologically, spatially, and semantically. We deploy EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of data collection per task. Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning - paving the way toward a future of abundant, diverse, and naturalistic training data for robots. Code and videos are available at https://egozero-robot.github.io.

Problem

Research questions and friction points this paper is trying to address.

Learning robot policies from human smart glasses data

Extracting robot-executable actions from human demonstrations

Achieving zero-shot transfer with minimal data collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns from human smart glasses data

Morphology-agnostic state representations

Zero-shot transfer with high success

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker