VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
VLA models exhibit a sharp decline in generalization under novel camera viewpoints and visual disturbances, primarily due to misaligned spatial modeling. This work uncovers an underappreciated robustness potential in their pre-trained visual representations and proposes the lightweight Visual Representation Calibration (VRC) framework—enabling cross-view adaptation with minimal parameter updates, obviating the need for large-scale fine-tuning. Methodologically, VRC integrates Feature Token Modulation (FTM) and Low-rank Feature Linear Adaptation (FLA) to apply global affine transformations and low-rank updates to the ViT encoder. Evaluated on the Libero benchmark, VRC boosts viewpoint accuracy from 48.5% to 90.8%. It achieves LoRA-level performance using only 4K–4.7M parameters, reducing adaptation overhead by orders of magnitude while preserving model efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
Problem

Research questions and friction points this paper is trying to address.

Addresses VLA model brittleness under novel camera viewpoints
Identifies spatial modeling misalignment as primary robustness issue
Proposes lightweight adaptation to restore viewpoint generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-shot adaptation framework recalibrates visual representations
Feature Token Modulation applies global affine transformation to tokens
Feature Linear Adaptation introduces low-rank updates to encoder
🔎 Similar Papers
No similar papers found.