DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models face critical limitations in autonomous driving evaluation: insufficient scenario diversity, unreliable action annotations, and misalignment between automated metrics and human preferences. To address these, we introduce AutoBench—the first action-driven VLA benchmark for autonomous driving—comprising 16,185 QA pairs across 2,610 realistic driving scenarios, with high-fidelity discrete action labels derived from production vehicle user operation data. Our contributions include: (1) an action-rooted evaluation paradigm; (2) a tree-structured joint visual-language-action assessment framework; and (3) a robustness evaluation method aligned with human preference rankings. Experiments demonstrate that bimodal (vision + language) input is essential for accurate action prediction: omitting either modality reduces accuracy by 3.3–4.1%; omitting both degrades it by 8.0%. AutoBench significantly improves bottleneck identification precision and cross-model evaluation consistency.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse driving scenarios in VLA benchmarks
Absence of reliable action-level annotations
Evaluation protocols misaligned with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-driven benchmark for VLA models
Real-world driving data for scenario coverage
Action-rooted tree-structured evaluation framework
🔎 Similar Papers
No similar papers found.
Y
Yuhan Hao
Li Auto Inc.
Z
Zhengning Li
Li Auto Inc.
L
Lei Sun
Li Auto Inc.
Weilong Wang
Weilong Wang
PhD Student, Purdue University
Information System
N
Naixin Yi
Li Auto Inc.
S
Sheng Song
Li Auto Inc.
C
Caihong Qin
Li Auto Inc.
M
Mofan Zhou
Li Auto Inc.
Y
Yifei Zhan
Li Auto Inc.
P
Peng Jia
Li Auto Inc.
X
Xianpeng Lang
Li Auto Inc.