An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The rapid advancement of Vision-Language-Action (VLA) models in embodied intelligence lacks systematic organization and critical synthesis. Method: We conduct a comprehensive literature review, taxonomic analysis, and technical roadmap construction to identify five core research frontiers—representation, execution, generalization, safety, and data & evaluation—and propose the first structured “module–milestone–challenge” analytical framework aligned with the evolutionary trajectory of general-purpose embodied agents. Contribution/Results: This work delivers an authoritative, pedagogically instructive, and research-forward survey that has become a de facto standard reference in embodied AI. It has catalyzed community consensus on evaluation benchmarks and data curation standards, and enables dynamic knowledge accumulation via a continuously updated online platform.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our href{https://suyuz1.github.io/Survery/}{project page}.
Problem

Research questions and friction points this paper is trying to address.

Surveying Vision-Language-Action models' architecture, evolution, and research challenges
Analyzing five core challenges in representation, execution, generalization, safety, and evaluation
Providing a structured guide for researchers to understand and advance embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey structured by modules, milestones, and challenges
Detailed breakdown of five core VLA research challenges
Provides a foundational guide and strategic roadmap for researchers
🔎 Similar Papers
No similar papers found.
C
Chao Xu
IROOTECH TECHNOLOGY
S
Suyu Zhang
IROOTECH TECHNOLOGY
Y
Yang Liu
Department of Engineering, King’s College London
Baigui Sun
Baigui Sun
Wolf 1069 b Lab, Sany Group
人工智能、计算机视觉
W
Weihong Chen
IROOTECH TECHNOLOGY
B
Bo Xu
IROOTECH TECHNOLOGY
Q
Qi Liu
IROOTECH TECHNOLOGY
J
Juncheng Wang
Hong Kong Polytechnic University
Shujun Wang
Shujun Wang
The Hong Kong Polytechnic University
AI for HealthcareSmart AgeingAI for Science
Shan Luo
Shan Luo
Reader (Associate Professor), King's College London
RoboticsRobot PerceptionTactile SensingComputer VisionMachine Learning
J
Jan Peters
Computer Science Department of the Technische Universität Darmstadt
A
Athanasios V. Vasilakos
Department of ICT and Center for AI Research, University of Agder (UiA)
Stefanos Zafeiriou
Stefanos Zafeiriou
Professor, Imperial College London
Computer VisionDeep LearningStatistical Machine LearningPattern RecognitionBiometrics
Jiankang Deng
Jiankang Deng
Imperial College London
Computer VisionMachine Learning