Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models, which are hindered by large parameter counts, high pretraining costs, and poor generalization across diverse robot embodiments, impeding real-world deployment. To overcome these challenges, we propose LLaVA-VLA, a lightweight end-to-end VLA system, and introduce CEBench—the first simulation-to-real benchmark designed for cross-embodiment generalization. Our approach leverages a compact vision-language backbone, multi-view and embodiment-aware tokenization, action chunking, and a two-stage training strategy (post-pretraining followed by fine-tuning) to unify navigation and manipulation action spaces without requiring massive-scale pretraining. Experiments demonstrate that LLaVA-VLA achieves strong generalization across multiple robot embodiments and successfully executes real-world end-to-end mobile manipulation tasks. Code and data will be publicly released.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

practicality

embodiment generalization

parameter efficiency

pre-training cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

CEBench

LLaVA-VLA