VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Current vision-language-action (VLA) models lack systematic, standardized capability evaluation benchmarks. To address this gap, we propose the first open-source VLA benchmarking framework, introducing a novel three-axis orthogonal difficulty modeling scheme—task structure, language instruction, and visual observation—defining structural levels L0–L2 and perturbation combinations W0–W4/V0–V4. We construct the multi-scale VLA-Arena-S/M/L dataset and an end-to-end automated evaluation pipeline. Empirical evaluation across mainstream VLAs reveals critical systemic bottlenecks: strong memorization but poor generalization, absence of safety constraints, and failure in long-horizon skill composition. Our framework enables fine-grained capability disentanglement and robustness attribution analysis. All components—including code, data, models, and a live leaderboard—are publicly released to foster reproducible, transparent VLA research and development.

Technology Category

Application Category

📝 Abstract

While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking Vision-Language-Action models' limits and failure modes

Systematically measuring model capabilities across structured task difficulties

Assessing robustness to language and visual perturbations in tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured task design framework for difficulty quantification

Orthogonal axes for decoupled robustness analysis

Comprehensive benchmark with fine-grained difficulty levels

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs