A Survey on Efficient Vision-Language-Action Models

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-Language-Action (VLA) models exhibit strong generalization capabilities but face significant deployment barriers in embodied AI due to prohibitive computational costs and large-scale data requirements. To address this, we propose the “Efficient VLA” research framework—a systematic, unified taxonomy spanning model architecture design, training optimization, and robot-centric data acquisition. Methodologically, we integrate lightweight architectures, model compression techniques, sample- and compute-efficient training strategies, and principled approaches for high-yield robotic data collection and utilization. We comprehensively survey state-of-the-art advances, distill representative application paradigms, and identify key challenges including scalability, out-of-distribution generalization, and data bias. Furthermore, we establish a continuously updated open-source project page. This work provides foundational theoretical insights and practical engineering guidance for developing computationally efficient, deployable embodied AI systems.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

Problem

Research questions and friction points this paper is trying to address.

Surveying efficient Vision-Language-Action models for embodied intelligence

Addressing computational and data bottlenecks in VLA deployment

Organizing efficient techniques across data, model, and training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient model design and compression techniques

Efficient training methods reduce computational burdens

Efficient data collection addresses robotic bottlenecks

🔎 Similar Papers

No similar papers found.

Authors to Follow