Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of enabling robots to achieve zero-shot operational generalization in unknown, complex environments from minimal human demonstration videos. To this end, we propose Retrieval-based Visual Learning (RfV), a novel framework that constructs a large-scale first-person daily manipulation video corpus. RfV extracts mid-level semantic representations by jointly modeling object functionality masks and hand motion trajectories; a video retrieval module identifies semantically similar demonstrations, and a policy generation network decodes them into executable actions. By integrating multimodal cues—including visual observations, action trajectories, and task descriptions—RfV enables cross-task knowledge transfer. Evaluated on both simulation and real-world robotic platforms, RfV significantly outperforms state-of-the-art imitation learning and vision-language model baselines, achieving high success rates on unseen tasks. These results demonstrate RfV’s strong generalization capability and practical deployability for real-world robotic manipulation.

Technology Category

Application Category

📝 Abstract

Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

Problem

Research questions and friction points this paper is trying to address.

Learning robot manipulation from human video demonstrations

Retrieving task-relevant video knowledge for policy generation

Enhancing robot generalization beyond training data scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning robot policies by Retrieving-from-Video

Extracting mid-level information like affordance masks

Dual-component system with retriever and policy generator

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs