R+X: Retrieval and Execution from Everyday Human Videos

📅 2024-07-17

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the problem of enabling robots to directly acquire skills from unlabeled, long-horizon, first-person human daily videos and execute them instantaneously upon receiving natural language instructions. Methodologically, we propose an end-to-end framework: first, a vision-language model (VLM) retrieves behavior segments semantically aligned with the instruction from raw video; second, context-aware imitation learning (KAT) enables zero-shot, fine-tuning-free, plug-and-play execution. Our key contribution is the first fully annotation-free end-to-end skill acquisition pipeline—eliminating reliance on task-specific training, online adaptation, or behavioral annotations. Evaluated on diverse household tasks, our approach significantly outperforms prior methods, demonstrating strong open-vocabulary instruction generalization and robust cross-scene transfer. This establishes a scalable “video-to-skill” paradigm for embodied intelligence.

Technology Category

Application Category

📝 Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.

Problem

Research questions and friction points this paper is trying to address.

Enables robots to learn from unlabelled human videos

Retrieves relevant video clips using Vision Language Model

Executes skills without training via in-context imitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves relevant video clips using Vision Language Model

Executes skills via in-context imitation learning

Learns from unlabelled human videos without manual annotation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs