A Survey on Efficient Vision-Language-Action Models

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models exhibit strong generalization capabilities but face significant deployment barriers in embodied AI due to prohibitive computational costs and large-scale data requirements. To address this, we propose the “Efficient VLA” research framework—a systematic, unified taxonomy spanning model architecture design, training optimization, and robot-centric data acquisition. Methodologically, we integrate lightweight architectures, model compression techniques, sample- and compute-efficient training strategies, and principled approaches for high-yield robotic data collection and utilization. We comprehensively survey state-of-the-art advances, distill representative application paradigms, and identify key challenges including scalability, out-of-distribution generalization, and data bias. Furthermore, we establish a continuously updated open-source project page. This work provides foundational theoretical insights and practical engineering guidance for developing computationally efficient, deployable embodied AI systems.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/
Problem

Research questions and friction points this paper is trying to address.

Surveying efficient Vision-Language-Action models for embodied intelligence
Addressing computational and data bottlenecks in VLA deployment
Organizing efficient techniques across data, model, and training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient model design and compression techniques
Efficient training methods reduce computational burdens
Efficient data collection addresses robotic bottlenecks
🔎 Similar Papers
No similar papers found.
Z
Zhaoshu Yu
School of Computer Science and Technology, Tongji University, China
B
Bo Wang
School of Computer Science and Technology, Tongji University, China
Pengpeng Zeng
Pengpeng Zeng
Tongji University
computer vision
H
Haonan Zhang
School of Computer Science and Technology, Tongji University, China
J
Ji Zhang
School of Computing and Artificial Intelligence, Southwest Jiaotong University, China
Lianli Gao
Lianli Gao
UESTC
Vision and Language
J
Jingkuan Song
School of Computer Science and Technology, Tongji University, China
Nicu Sebe
Nicu Sebe
University of Trento
computer visionmultimedia
H
Heng Tao Shen
School of Computer Science and Technology, Tongji University, China