Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the degradation of vision-language (VL) representations during action fine-tuning of Vision-Language-Action (VLA) models, which impairs out-of-distribution (OOD) generalization. We systematically characterize the trade-off between action adaptation and visual representation collapse. To mitigate this, we propose a lightweight hidden-layer representation alignment strategy that explicitly preserves pre-trained VL knowledge via cross-task feature constraints and attention-guided regularization—without incurring additional inference overhead. Through representation probing, attention visualization, and ablation on contrastive tasks, we demonstrate that our method significantly alleviates visual representation degradation. Empirically, it improves OOD generalization across multiple robotic manipulation benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

Problem

Research questions and friction points this paper is trying to address.

Studying visual representation degradation during VLA fine-tuning

Measuring VL capability loss caused by action adaptation

Developing methods to improve OOD generalization in VLAs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligning visual representations to prevent degradation

Probing hidden representations and analyzing attention maps

Mitigating degradation for improved OOD generalization

🔎 Similar Papers

Law of Vision Representation in MLLMs