VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models lack systematic, standardized capability evaluation benchmarks. To address this gap, we propose the first open-source VLA benchmarking framework, introducing a novel three-axis orthogonal difficulty modeling scheme—task structure, language instruction, and visual observation—defining structural levels L0–L2 and perturbation combinations W0–W4/V0–V4. We construct the multi-scale VLA-Arena-S/M/L dataset and an end-to-end automated evaluation pipeline. Empirical evaluation across mainstream VLAs reveals critical systemic bottlenecks: strong memorization but poor generalization, absence of safety constraints, and failure in long-horizon skill composition. Our framework enables fine-grained capability disentanglement and robustness attribution analysis. All components—including code, data, models, and a live leaderboard—are publicly released to foster reproducible, transparent VLA research and development.

Technology Category

Application Category

📝 Abstract
While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking Vision-Language-Action models' limits and failure modes
Systematically measuring model capabilities across structured task difficulties
Assessing robustness to language and visual perturbations in tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured task design framework for difficulty quantification
Orthogonal axes for decoupled robustness analysis
Comprehensive benchmark with fine-grained difficulty levels
🔎 Similar Papers
No similar papers found.
Borong Zhang
Borong Zhang
University of Macau
Reinforcement learningRobotics
J
Jiahao Li
Institute for Artificial Intelligence, Peking University
Jiachen Shen
Jiachen Shen
University of Science and Technology Beijing
Y
Yishuai Cai
Institute for Artificial Intelligence, Peking University
Y
Yuhao Zhang
Institute for Artificial Intelligence, Peking University
Yuanpei Chen
Yuanpei Chen
South China University of Technology
Robotic
J
Juntao Dai
Beijing Academy of Artificial Intelligence
J
Jiaming Ji
Institute for Artificial Intelligence, Peking University
Y
Yaodong Yang
Institute for Artificial Intelligence, Peking University