CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

📅 2024-11-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Enabling non-expert users to efficiently acquire robotic manipulation skills via natural language remains challenging due to the difficulty of grounding linguistic instructions in real-world visuomotor control. Method: We propose a natural language–driven robotic skill learning framework that enables direct acquisition of real-robot demonstrations from everyday language commands (e.g., “move the robot arm right”) and learns language-conditioned visuomotor policies. At its core lies CLIP-RT—a lightweight 1B-parameter model built upon the CLIP vision-language encoder, incorporating contrastive imitation learning and motion primitive modeling. The method integrates multi-task pretraining on Open X-Embodiment with domain-specific fine-tuning. Contribution/Results: CLIP-RT outperforms the 7B-parameter OpenVLA in few-shot generalization and achieves a 24% average success rate improvement on cross-task manipulation benchmarks, using only 1/7 the parameters of current SOTA models. Experiments validate its effectiveness in human-robot collaboration and large-model co-optimization, overcoming key generalization bottlenecks in complex robotic tasks.

Technology Category

Application Category

📝 Abstract
Teaching robots desired skills in real-world environments remains challenging, especially for non-experts. The reliance on specialized expertise in robot control and teleoperation systems often limits accessibility to non-experts. We posit that natural language offers an intuitive and accessible interface for robot learning. To this end, we study two aspects: (1) enabling non-experts to collect robotic data through natural language supervision (e.g.,"move the arm to the right") and (2) learning robotic policies directly from this supervision. Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations. We then present CLIP-RT, a vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision. CLIP-RT adapts the pretrained CLIP models and learns to predict language-based motion primitives via contrastive imitation learning. We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework to learn diverse skills. CLIP-RT demonstrates strong capabilities in learning novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 24% in average success rates, while using 7x fewer parameters (1B). We further observe that CLIP-RT shows significant improvements in few-shot generalization. Finally, through collaboration with humans or large pretrained models, we demonstrate that CLIP-RT can further improve its generalization on challenging tasks.
Problem

Research questions and friction points this paper is trying to address.

Teaching robots skills via natural language.
Reducing reliance on expert robot control.
Improving robot learning with CLIP-RT model.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural language supervision
Vision-language-action model
Contrastive imitation learning
🔎 Similar Papers
No similar papers found.