Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work identifies the lack of high-quality, scalable data infrastructure as a fundamental bottleneck in Vision-Language-Action (VLA) research and establishes data infrastructure as a central challenge in the field. The study systematically reviews existing datasets, evaluation benchmarks, and data engines, offering a comparative analysis structured around embodied diversity, multimodal alignment, task complexity, and physical realism. It uncovers critical evaluation gaps—particularly in long-horizon reasoning, compositional generalization, and sim-to-real transfer—and argues for the co-design of high-fidelity data engines and structured evaluation protocols. The paper further distills four key open challenges: representation alignment, multimodal supervision, reasoning evaluation, and scalable data generation, thereby charting a data-centric path forward for VLA research.

Technology Category

Application Category

📝 Abstract

Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

data infrastructure

embodied learning

VLA benchmarks

scalable data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

data-centric analysis

data engines