DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) models face critical limitations in autonomous driving evaluation: insufficient scenario diversity, unreliable action annotations, and misalignment between automated metrics and human preferences. To address these, we introduce AutoBench—the first action-driven VLA benchmark for autonomous driving—comprising 16,185 QA pairs across 2,610 realistic driving scenarios, with high-fidelity discrete action labels derived from production vehicle user operation data. Our contributions include: (1) an action-rooted evaluation paradigm; (2) a tree-structured joint visual-language-action assessment framework; and (3) a robustness evaluation method aligned with human preference rankings. Experiments demonstrate that bimodal (vision + language) input is essential for accurate action prediction: omitting either modality reduces accuracy by 3.3–4.1%; omitting both degrades it by 8.0%. AutoBench significantly improves bottleneck identification precision and cross-model evaluation consistency.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Lack of diverse driving scenarios in VLA benchmarks

Absence of reliable action-level annotations

Evaluation protocols misaligned with human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-driven benchmark for VLA models

Real-world driving data for scenario coverage

Action-rooted tree-structured evaluation framework

🔎 Similar Papers

No similar papers found.