LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

📅 2024-06-28
🏛️ arXiv.org
📈 Citations: 13
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of demonstration data in robot policy learning, this paper proposes a lightweight vision-language-action (VLA) transfer framework that efficiently adapts pre-trained vision-language models (VLMs) into executable robotic policy models. Methodologically, it introduces a novel dialogue-generation paradigm aligned at the “action → pixel-coordinate” level, framing robotic manipulation as vision–text interaction. Six annotation-free self-supervised auxiliary tasks are designed, jointly integrating behavior cloning, pixel-level action alignment, and multi-task instruction tuning. Evaluated on both simulation and real-robot platforms, the framework achieves state-of-the-art performance with only a few demonstrations, significantly improving few-shot generalization. Crucially, it preserves the VLM’s strong language comprehension and cross-task transfer capabilities while enabling end-to-end visuomotor control.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Robot Learning
Visual-Linguistic Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaRA
Visual-Language-Action Models
Continuous Learning
🔎 Similar Papers
No similar papers found.