Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

πŸ“… 2026-02-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models, which are hindered by large parameter counts, high pretraining costs, and poor generalization across diverse robot embodiments, impeding real-world deployment. To overcome these challenges, we propose LLaVA-VLA, a lightweight end-to-end VLA system, and introduce CEBenchβ€”the first simulation-to-real benchmark designed for cross-embodiment generalization. Our approach leverages a compact vision-language backbone, multi-view and embodiment-aware tokenization, action chunking, and a two-stage training strategy (post-pretraining followed by fine-tuning) to unify navigation and manipulation action spaces without requiring massive-scale pretraining. Experiments demonstrate that LLaVA-VLA achieves strong generalization across multiple robot embodiments and successfully executes real-world end-to-end mobile manipulation tasks. Code and data will be publicly released.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
practicality
embodiment generalization
parameter efficiency
pre-training cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
CEBench
LLaVA-VLA
lightweight architecture
end-to-end mobile manipulation
πŸ”Ž Similar Papers
No similar papers found.
Wenxuan Song
Wenxuan Song
The Hong Kong University of Science and Technology (Guangzhou)
Vision-language-action ModelRobotics
Jiayi Chen
Jiayi Chen
cuhksz
AIRoboticsControl
X
Xiaoquan Sun
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.; Huazhong University of Science and Technology, Wuhan, China.
H
Huashuo Lei
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Y
Yikai Qin
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
W
Wei Zhao
Westlake University, Hangzhou, China.
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
Han Zhao
Han Zhao
Zhejiang University | Westlake University
Embodied intelligenceReinforcement learningMultimodal large language modelsControl theory
T
Tongxin Wang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
P
Pengxu Hou
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Zhide Zhong
Zhide Zhong
Beijing Institute of Technology
Robotics
Haodong Yan
Haodong Yan
PhD student of INTR, HKUST (GZ)
Human reconstructionmotion prediction
Donglin Wang
Donglin Wang
Westlake University
Deep Reinforcement LearningMeta LearningRobot Learning
Jun Ma
Jun Ma
Assistant Professor, The Hong Kong University of Science and Technology
RoboticsAutonomous DrivingMotion Planning and ControlOptimization
Haoang Li
Haoang Li
Assistant Professor, Hong Kong University of Science and Technology (Guangzhou)
Robotics3D Computer Vision