🤖 AI Summary
To address the poor generalization of imitation learning across tasks and visual appearances, this paper proposes BC-ViT: the first framework to integrate semantic patch embeddings from DINO-pretrained Vision Transformers into behavioral cloning. It employs unsupervised appearance clustering to automatically generate stable, transferable semantic keypoints—replacing hand-crafted or supervised keypoint detectors. By decoupling visual representation learning from policy learning, BC-ViT significantly improves robustness to unseen object categories, materials, and appearance variations. On a multi-task robotic manipulation dataset, BC-ViT substantially outperforms existing baselines. All code, datasets, and evaluation protocols are publicly released, establishing a new paradigm and benchmark resource for studying generalization in imitation learning.
📝 Abstract
In this paper we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We show that this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. Our method, data and evaluation approach are made available to facilitate further study of generalization in Imitation Learners.