Phantom: Training Robots Without Robots Using Only Human Videos

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Robot imitation learning faces high costs and scalability challenges in acquiring real-world hardware interaction data. Method: This paper proposes a zero-shot policy training framework driven exclusively by human RGB-D videos, eliminating the need for any real-robot interaction or hardware involvement during training. It bridges the morphological gap between human demonstrations and robot execution via cross-domain image distribution alignment and a data editing mechanism that maps human poses to robot actions. Contribution/Results: To our knowledge, this is the first end-to-end imitation learning paradigm relying solely on publicly available human RGB-D videos, enabling low-cost, user-contributed data collection. Experiments demonstrate reliable zero-shot deployment across unseen environments and diverse manipulation tasks, significantly lowering the data acquisition barrier and cost for general-purpose robotic learning.

Technology Category

Application Category

📝 Abstract

Scaling robotics data collection is critical to advancing general-purpose robots. Current approaches often rely on teleoperated demonstrations which are difficult to scale. We propose a novel data collection method that eliminates the need for robotics hardware by leveraging human video demonstrations. By training imitation learning policies on this human data, our approach enables zero-shot deployment on robots without collecting any robot-specific data. To bridge the embodiment gap between human and robot appearances, we utilize a data editing approach on the input observations that aligns the image distributions between training data on humans and test data on robots. Our method significantly reduces the cost of diverse data collection by allowing anyone with an RGBD camera to contribute. We demonstrate that our approach works in diverse, unseen environments and on varied tasks.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for robotics hardware using human videos.

Bridges embodiment gap between human and robot appearances.

Reduces cost of diverse data collection with RGBD cameras.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages human video for robot training

Uses data editing to bridge appearance gap

Enables zero-shot robot deployment

🔎 Similar Papers

R+X: Retrieval and Execution from Everyday Human Videos