🤖 AI Summary
Enabling non-expert users to efficiently acquire robotic manipulation skills via natural language remains challenging due to the difficulty of grounding linguistic instructions in real-world visuomotor control.
Method: We propose a natural language–driven robotic skill learning framework that enables direct acquisition of real-robot demonstrations from everyday language commands (e.g., “move the robot arm right”) and learns language-conditioned visuomotor policies. At its core lies CLIP-RT—a lightweight 1B-parameter model built upon the CLIP vision-language encoder, incorporating contrastive imitation learning and motion primitive modeling. The method integrates multi-task pretraining on Open X-Embodiment with domain-specific fine-tuning.
Contribution/Results: CLIP-RT outperforms the 7B-parameter OpenVLA in few-shot generalization and achieves a 24% average success rate improvement on cross-task manipulation benchmarks, using only 1/7 the parameters of current SOTA models. Experiments validate its effectiveness in human-robot collaboration and large-model co-optimization, overcoming key generalization bottlenecks in complex robotic tasks.
📝 Abstract
Teaching robots desired skills in real-world environments remains challenging, especially for non-experts. The reliance on specialized expertise in robot control and teleoperation systems often limits accessibility to non-experts. We posit that natural language offers an intuitive and accessible interface for robot learning. To this end, we study two aspects: (1) enabling non-experts to collect robotic data through natural language supervision (e.g.,"move the arm to the right") and (2) learning robotic policies directly from this supervision. Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations. We then present CLIP-RT, a vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision. CLIP-RT adapts the pretrained CLIP models and learns to predict language-based motion primitives via contrastive imitation learning. We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework to learn diverse skills. CLIP-RT demonstrates strong capabilities in learning novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 24% in average success rates, while using 7x fewer parameters (1B). We further observe that CLIP-RT shows significant improvements in few-shot generalization. Finally, through collaboration with humans or large pretrained models, we demonstrate that CLIP-RT can further improve its generalization on challenging tasks.