Generalizable Imitation Learning Through Pre-Trained Representations

📅 2023-11-15

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the poor generalization of imitation learning across tasks and visual appearances, this paper proposes BC-ViT: the first framework to integrate semantic patch embeddings from DINO-pretrained Vision Transformers into behavioral cloning. It employs unsupervised appearance clustering to automatically generate stable, transferable semantic keypoints—replacing hand-crafted or supervised keypoint detectors. By decoupling visual representation learning from policy learning, BC-ViT significantly improves robustness to unseen object categories, materials, and appearance variations. On a multi-task robotic manipulation dataset, BC-ViT substantially outperforms existing baselines. All code, datasets, and evaluation protocols are publicly released, establishing a new paradigm and benchmark resource for studying generalization in imitation learning.

📝 Abstract

In this paper we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We show that this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. Our method, data and evaluation approach are made available to facilitate further study of generalization in Imitation Learners.

Problem

Research questions and friction points this paper is trying to address.

Improve generalization in imitation learning using pre-trained vision transformers.

Introduce DVK algorithm for better generalization through visual embeddings.

Evaluate generalization across diverse object manipulation tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained Visual Transformer embeddings

Clusters appearance features for semantic concepts

Enables generalization across diverse object tasks

🔎 Similar Papers

Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models