NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation of Vision-Language-Action (VLA) agents faces two key bottlenecks: (1) overreliance on coarse terminal task success rates, which impedes fine-grained diagnosis of specific capability deficits and robustness assessment under realistic perturbations; and (2) fragmented, nonstandardized benchmark data and interfaces, hindering reproducible research and generalizable model development. To address these, we propose NEBULA—a unified evaluation ecosystem for single-arm manipulation—introducing the first dual-axis protocol integrating *capability testing* (e.g., spatial reasoning, dynamic adaptation) and *stress testing* (e.g., illumination, viewpoint, occlusion variations) for precise, interpretable diagnostics. NEBULA provides standardized APIs and a large-scale, aggregated dataset enabling cross-dataset training and fair, apples-to-apples comparison. Experiments reveal significant deficiencies in core embodied capabilities across state-of-the-art VLA models, empirically validating NEBULA’s diagnostic depth and evaluation reliability.

Technology Category

Application Category

📝 Abstract
The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce extbf{NEBULA}, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained extit{capability tests} for precise skill diagnosis with systematic extit{stress tests} that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Vision-Language-Action agents with coarse success metrics
Addressing fragmented data landscape hindering reproducible research
Measuring agent robustness to real-world perturbations systematically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified ecosystem for diagnostic manipulation evaluation
Dual-axis protocol combining capability and stress tests
Standardized API with aggregated dataset reducing fragmentation
J
Jierui Peng
Department of Computer & Data Sciences, Case Western Reserve University
Y
Yanyan Zhang
Department of Computer & Data Sciences, Case Western Reserve University
Yicheng Duan
Yicheng Duan
Case Western Reserve University
Embodied AICV
Tuo Liang
Tuo Liang
Case Western
VLMVisual ReasoningVisual Hallucination
Vipin Chaudhary
Vipin Chaudhary
Case Western Reserve University
High Performance ComputingArtificial IntelligenceData ScienceComputer VisionQuantum Computing
Y
Yu Yin
Department of Computer & Data Sciences, Case Western Reserve University