Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address semantic knowledge degradation, slow training convergence, and reduced generalization caused by jointly training large-scale pretrained vision-language model (VLM) backbones with continuous-action experts in vision-language-action (VLA) models, this paper proposes a knowledge insulation mechanism. Our method employs hierarchical freezing of critical VLM layers and gradient insulation to prevent backward interference from the action expert into the semantic module, augmented by semantic consistency regularization. This work is the first to systematically characterize how action-expert architectures impair semantic transfer. Evaluated on RT-2 and OpenVLA benchmarks, our approach achieves 2.1× faster training convergence, improves zero-shot cross-task generalization accuracy by an average of 14.3%, and significantly enhances VLM semantic representation fidelity over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains an open question whether they preserve or degrade the semantic knowledge contained in the pretrained VLM, and what effect they have on the VLA training dynamics. In this paper, we study this question in the context of VLAs that include a continuous diffusion or flow matching action expert, showing that naively including such experts significantly harms both training speed and knowledge transfer. We provide an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue. Videos are available at https://pi.website/research/knowledge_insulation.

Problem

Research questions and friction points this paper is trying to address.

Addressing real-time control limitations in vision-language-action models

Preserving semantic knowledge in pretrained VLMs during VLA training

Improving training speed and knowledge transfer with insulation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized modules for efficient continuous control

Insulating VLM backbone during VLA training

Diffusion or flow matching action expert

🔎 Similar Papers

No similar papers found.