🤖 AI Summary
This work addresses the limited generalization of RGB-based imitation learning to unseen objects or scenes and its heavy reliance on large demonstration datasets. To overcome these challenges, the authors propose a Keypoint-based Imitation Learning (KIL) framework that leverages vision foundation models to extract keypoints from a single demonstration as an intermediate representation. The study systematically evaluates various design choices and, for the first time, extends KIL to multi-instance manipulation tasks. Evaluated on five real-world tasks, the method achieves an overall success rate of 75%, significantly outperforming RGB baselines (47%) and matching the performance of S2-diffusion (73%), thereby demonstrating its data efficiency and strong generalization. The work also identifies limitations of current vision foundation models in keypoint extraction and provides practical design guidelines for effective KIL implementation.
📝 Abstract
RGB-based imitation learning requires many demonstrations to generalize to unseen objects or scenes, motivating research into intermediate representations to improve generalization for robotic manipulation. Visual foundation models enable one-shot extraction of keypoints to provide such representation. However, it remains unclear how to integrate them into imitation learning optimally and when they outperform alternative representations. We combine approaches from previous works on keypoint imitation learning (KIL) and investigate several design choices to provide practical guidelines. Using over 2000 real-world rollouts, we also assess the generalization capabilities of KIL to unseen objects and scene variations. KIL achieves a 75% overall success rate across five tasks, significantly outperforming the RGB baseline (47%) and performing on par with S2-diffusion (73%). Finally, we explore the limitations of the foundation models used for keypoint extraction and extend KIL to tasks with multiple object instances. Our results confirm KIL as a data-efficient approach for robot learning, though it does not outperform alternative representations and inherits limitations of the foundation models used for keypoint extraction. All rollout videos, demonstrations, and results are available at https://kil-manipulation.github.io/.