Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address catastrophic forgetting in vision-language models (VLMs) fine-tuned into vision-language-action (VLA) models—where action learning degrades original multimodal reasoning and instruction-following capabilities due to distribution shift—this paper proposes the VLM2VLA training paradigm. Our core innovation encodes low-level robot actions as natural language representations, unifying action and language distributions at the data level without modifying the backbone or requiring large-scale retraining. Fine-tuning is efficiently achieved using only LoRA. The method integrates natural language action encoding, lightweight adaptation, and visual question answering–based evaluation. Evaluated across 800+ real-world robot teleoperation experiments, it demonstrates strong zero-shot generalization to novel tasks while fully preserving the VLM’s original open-world semantic understanding, cross-lingual instruction following, and multimodal reasoning abilities.

Technology Category

Application Category

📝 Abstract

Fine-tuning vision-language models (VLMs) on robot teleoperation data to create vision-language-action (VLA) models is a promising paradigm for training generalist policies, but it suffers from a fundamental tradeoff: learning to produce actions often diminishes the VLM's foundational reasoning and multimodal understanding, hindering generalization to novel scenarios, instruction following, and semantic understanding. We argue that this catastrophic forgetting is due to a distribution mismatch between the VLM's internet-scale pretraining corpus and the robotics fine-tuning data. Inspired by this observation, we introduce VLM2VLA: a VLA training paradigm that first resolves this mismatch at the data level by representing low-level actions with natural language. This alignment makes it possible to train VLAs solely with Low-Rank Adaptation (LoRA), thereby minimally modifying the VLM backbone and averting catastrophic forgetting. As a result, the VLM can be fine-tuned on robot teleoperation data without fundamentally altering the underlying architecture and without expensive co-training on internet-scale VLM datasets. Through extensive Visual Question Answering (VQA) studies and over 800 real-world robotics experiments, we demonstrate that VLM2VLA preserves the VLM's core capabilities, enabling zero-shot generalization to novel tasks that require open-world semantic reasoning and multilingual instruction following.

Problem

Research questions and friction points this paper is trying to address.

Prevents catastrophic forgetting in vision-language-action model training

Aligns robot action data with natural language representations

Enables zero-shot generalization for semantic reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representing robot actions with natural language

Using Low-Rank Adaptation to prevent catastrophic forgetting

Preserving VLM capabilities while enabling robot control

🔎 Similar Papers

No similar papers found.