🤖 AI Summary
Current vision-language-action (VLA) models exhibit poor generalization to out-of-distribution (OOD) visual tasks, primarily due to catastrophic forgetting in the visual backbone (e.g., DINO-v2) during multi-task fine-tuning, which degrades cross-domain visual representations. To address this, we propose Progressive Backbone Inversion (PBI), the first method to reversibly correct domain bias in pre-trained visual backbones without retraining—restoring visual generalizability losslessly via a lightweight model fusion mechanism. Our technical pipeline includes backbone decoupling, DINO-v2 feature diagnostics, depth-wise regression validation, and seamless integration into the OpenVLA architecture. On OOD grasping and lifting tasks, PBI achieves +77% and +66% performance gains over OpenVLA, respectively, demonstrating significantly improved robustness across diverse scenes and modalities.
📝 Abstract
Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks .