How Do VLAs Effectively Inherit from VLMs?

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work investigates how vision-language-action (VLA) models can effectively inherit prior knowledge from large vision-language models (VLMs) to improve generalization in embodied control. Method: We introduce GrinningFace—a novel diagnostic benchmark that uses internet-familiar yet robot-data-scarce emoji as “clean proxies” to construct facial-expression manipulation tasks in both simulated and real-robot environments, enabling systematic evaluation of knowledge transfer. We comparatively analyze strategies including parameter-efficient fine-tuning, VLM freezing, joint training, and discrete/latent action prediction. Results: Experiments demonstrate that preserving VLM priors substantially enhances VLA cross-task generalization, confirming their critical role in embodied intelligence. GrinningFace provides a reproducible, interpretable diagnostic framework for VLA knowledge transfer, offering both theoretical insights and practical guidance for building generalized embodied AI systems.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.

Problem

Research questions and friction points this paper is trying to address.

Investigating effective knowledge transfer from vision-language to action models

Developing diagnostic benchmarks for embodied AI generalization capabilities

Evaluating methods to preserve semantic priors in robot manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emoji tabletop manipulation diagnostic benchmark

Parameter-efficient fine-tuning for knowledge transfer

Preserving VLM priors for embodied generalization

🔎 Similar Papers

Distributed Rule Vectors is A Key Mechanism in Large Language Models' In-Context Learning