🤖 AI Summary
To address the scarcity of demonstration data in robot policy learning, this paper proposes a lightweight vision-language-action (VLA) transfer framework that efficiently adapts pre-trained vision-language models (VLMs) into executable robotic policy models. Methodologically, it introduces a novel dialogue-generation paradigm aligned at the “action → pixel-coordinate” level, framing robotic manipulation as vision–text interaction. Six annotation-free self-supervised auxiliary tasks are designed, jointly integrating behavior cloning, pixel-level action alignment, and multi-task instruction tuning. Evaluated on both simulation and real-robot platforms, the framework achieves state-of-the-art performance with only a few demonstrations, significantly improving few-shot generalization. Crucially, it preserves the VLM’s strong language comprehension and cross-task transfer capabilities while enabling end-to-end visuomotor control.
📝 Abstract
Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.