🤖 AI Summary
This work addresses the problem of enabling robots to directly acquire skills from unlabeled, long-horizon, first-person human daily videos and execute them instantaneously upon receiving natural language instructions. Methodologically, we propose an end-to-end framework: first, a vision-language model (VLM) retrieves behavior segments semantically aligned with the instruction from raw video; second, context-aware imitation learning (KAT) enables zero-shot, fine-tuning-free, plug-and-play execution. Our key contribution is the first fully annotation-free end-to-end skill acquisition pipeline—eliminating reliance on task-specific training, online adaptation, or behavioral annotations. Evaluated on diverse household tasks, our approach significantly outperforms prior methods, demonstrating strong open-vocabulary instruction generalization and robust cross-scene transfer. This establishes a scalable “video-to-skill” paradigm for embodied intelligence.
📝 Abstract
We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.