ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

📅 2024-09-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current vision-language-action (VLA) models exhibit poor generalization to out-of-distribution (OOD) visual tasks, primarily due to catastrophic forgetting in the visual backbone (e.g., DINO-v2) during multi-task fine-tuning, which degrades cross-domain visual representations. To address this, we propose Progressive Backbone Inversion (PBI), the first method to reversibly correct domain bias in pre-trained visual backbones without retraining—restoring visual generalizability losslessly via a lightweight model fusion mechanism. Our technical pipeline includes backbone decoupling, DINO-v2 feature diagnostics, depth-wise regression validation, and seamless integration into the OpenVLA architecture. On OOD grasping and lifting tasks, PBI achieves +77% and +66% performance gains over OpenVLA, respectively, demonstrating significantly improved robustness across diverse scenes and modalities.

Technology Category

Application Category

📝 Abstract

Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks .

Problem

Research questions and friction points this paper is trying to address.

Addresses visual generalization limitations in robotic foundation models.

Proposes solution to catastrophic forgetting in vision foundation models.

Enhances performance in out-of-domain visual tasks for robotics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradual backbone reversal for visual generalization

Model merging to overcome catastrophic forgetting

Enhanced OpenVLA performance in out-of-domain tasks

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey