A Survey on Vision-Language-Action Models for Embodied AI

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 1

career value

221K/year

🤖 AI Summary

This paper addresses the core challenge of how Vision-Language-Action (VLA) models support language-conditioned robotic tasks in embodied intelligence. Methodologically, it introduces the first systematic, panoramic survey framework, proposing a three-dimensional taxonomy—“Component Design–Low-level Action Policies–High-level Task Planning”—that unifies VLA modeling, embodied control, task decomposition, simulation integration, and cross-benchmark evaluation. Key contributions include: (1) the first explicit characterization of three principal VLA technical paradigms; (2) a comprehensive survey of multimodal datasets, embodied simulation platforms, and standardized evaluation benchmarks; and (3) a structured knowledge graph that identifies critical open challenges—including scalable architecture design, world model integration, and real-world deployment—and outlines promising future research directions.

Technology Category

Application Category

📝 Abstract

Embodied AI is widely recognized as a key element of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models -- referred to as vision-language-action models (VLAs) -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. In recent years, a myriad of VLAs have been developed, making it imperative to capture the rapidly evolving landscape through a comprehensive survey. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges faced by VLAs and outline promising future directions in embodied AI.

Problem

Research questions and friction points this paper is trying to address.

Surveying vision-language-action models for embodied AI tasks.

Classifying VLAs into components, control policies, and task planners.

Identifying challenges and future directions in embodied AI research.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language-action models for embodied AI

Control policies predicting low-level actions

High-level task planners for long-horizon tasks

🔎 Similar Papers

Contextual Emotion Recognition using Large Vision Language Models