Rethinking VLM Representation for VLA Initialization

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work investigates how pretrained vision-language model (VLM) representations can effectively initialize vision-language-action (VLA) models, framing VLA initialization as a controlled representation design problem. Through systematic analysis across three key dimensions—embodied visual question answering (VQA) supervision, parameter update strategies, and robot data pretraining—the study reveals that the original VLM representations critically determine downstream action performance and demonstrates that gains from embodied VQA are constrained by downstream bottlenecks and cannot be trivially composited. To address these challenges, the authors propose a staged LoRA fine-tuning strategy that, when combined with robot trajectory pretraining, significantly outperforms full-parameter fine-tuning, offering an efficient and reliable solution for VLA initialization.

📝 Abstract

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

VLM representation

VLA initialization

embodied VQA

robot-data pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

VLM initialization

embodied VQA