🤖 AI Summary
This work investigates how vision-language-action (VLA) models can effectively inherit prior knowledge from large vision-language models (VLMs) to improve generalization in embodied control. Method: We introduce GrinningFace—a novel diagnostic benchmark that uses internet-familiar yet robot-data-scarce emoji as “clean proxies” to construct facial-expression manipulation tasks in both simulated and real-robot environments, enabling systematic evaluation of knowledge transfer. We comparatively analyze strategies including parameter-efficient fine-tuning, VLM freezing, joint training, and discrete/latent action prediction. Results: Experiments demonstrate that preserving VLM priors substantially enhances VLA cross-task generalization, confirming their critical role in embodied intelligence. GrinningFace provides a reproducible, interpretable diagnostic framework for VLA knowledge transfer, offering both theoretical insights and practical guidance for building generalized embodied AI systems.
📝 Abstract
Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.