VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the lack of action-execution capability in pretrained vision-language models (VLMs), this paper proposes an Action Expert Distillation framework that avoids costly end-to-end training of vision-language-action (VLA) models. Methodologically, it employs a two-stage knowledge distillation approach: first, freezing the original VLM backbone; second, introducing only learnable action tokens and a lightweight state encoder, with knowledge transfer achieved via hidden-state alignment, selective fine-tuning, and joint multimodal input modeling. This design drastically reduces computational overhead while improving action generation accuracy. Evaluated on LIBERO, LIBERO-LONG, and real-robot benchmarks, the method achieves average success rates of 97.3%, 93.5%, and 82.0%, respectively—outperforming the teacher model by 17%—demonstrating both efficiency and strong generalization across diverse manipulation tasks.

Technology Category

Application Category

📝 Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

Problem

Research questions and friction points this paper is trying to address.

Teaching vision-language models robotic actions via knowledge distillation

Reducing training costs for vision-language action models

Enabling precise action generation while maintaining perception capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling action knowledge from small pretrained models

Adding action token and state encoder to VLM structure

Two-stage training with alignment and selective fine-tuning

🔎 Similar Papers

Mamba Fusion: Learning Actions Through Questioning