A Survey on Vision-Language-Action Models for Autonomous Driving

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

The autonomous driving community lacks a systematic survey of Vision-Language-Action (VLA) models—existing literature is fragmented, architectural evolution is unclear, benchmarking is inconsistent, and critical challenges—including robustness, real-time inference, and formal verification—remain unaddressed in an integrated manner. Method: We propose VLA4AD, the first unified framework for VLA in autonomous driving, formally modeling its core components and articulating an “explain → reason → plan” evolutionary pathway. We construct a horizontal comparative taxonomy spanning 20+ models, integrating multimodal foundation models, end-to-end policy learning, and driving-scene reasoning techniques. We design a comprehensive evaluation protocol balancing safety, accuracy, and explainability. Contribution/Results: We release an open-source literature review and resource repository, establishing a theoretical foundation and practical roadmap for interpretable, socially aligned, VLA-driven autonomous driving research.

Technology Category

Application Category

📝 Abstract

The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at href{https://github.com/JohnsonJiang1996/Awesome-VLA4AD}{SicongJiang/Awesome-VLA4AD}.

Problem

Research questions and friction points this paper is trying to address.

Surveying Vision-Language-Action models for autonomous driving

Comparing over 20 models in autonomous driving domain

Addressing challenges like robustness and real-time efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision, language, action in single policy

Compares over 20 representative VLA models

Highlights joint safety, accuracy, explanation protocols

🔎 Similar Papers

No similar papers found.