🤖 AI Summary
VLA models exhibit a sharp decline in generalization under novel camera viewpoints and visual disturbances, primarily due to misaligned spatial modeling. This work uncovers an underappreciated robustness potential in their pre-trained visual representations and proposes the lightweight Visual Representation Calibration (VRC) framework—enabling cross-view adaptation with minimal parameter updates, obviating the need for large-scale fine-tuning. Methodologically, VRC integrates Feature Token Modulation (FTM) and Low-rank Feature Linear Adaptation (FLA) to apply global affine transformations and low-rank updates to the ViT encoder. Evaluated on the Libero benchmark, VRC boosts viewpoint accuracy from 48.5% to 90.8%. It achieves LoRA-level performance using only 4K–4.7M parameters, reducing adaptation overhead by orders of magnitude while preserving model efficiency and scalability.
📝 Abstract
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.