AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving

📅 2026-01-21
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in autonomous driving evaluation, which has predominantly focused on perception while neglecting decision-making capabilities of Vision-Language Models (VLMs). We propose the first decision-centric, progressive benchmark comprising three hierarchical levels—object, scene, and decision—with 6,650 carefully curated questions to systematically assess VLMs’ boundaries in transitioning from perception to decision-making and their reasoning interpretability. By establishing an interpretable evaluation framework, we reveal a weak correlation between perception and decision performance and introduce an analyzer model enabling large-scale automatic annotation. Our analysis identifies key failure modes of prevailing VLMs in autonomous driving decision tasks, offering both a rigorous evaluation foundation and actionable directions for developing safer, more reliable models.

Technology Category

Application Category

📝 Abstract
Autonomous driving is a highly challenging domain that requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks and metrics overemphasize perceptual competence and fail to adequately assess decision-making processes. In this work, we present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision. We evaluate mainstream VLMs to delineate the perception-to-decision capability boundary in autonomous driving, and our correlation analysis reveals weak alignment between perception and decision-making performance. We further conduct explainability analyses of models'reasoning processes, identifying key failure modes such as logical reasoning errors, and introduce an analyzer model to automate large-scale annotation. AutoDriDM bridges the gap between perception-centered and decision-centered evaluation, providing guidance toward safer and more reliable VLMs for real-world autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
vision-language models
decision-making
benchmark
perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
autonomous driving
decision-making benchmark
explainability
perception-decision gap
🔎 Similar Papers
No similar papers found.
Z
Zecong Tang
Zhejiang University, Hangzhou, China
Zixu Wang
Zixu Wang
Technical University of Munich & Infineon Technologies AG.
Deep learningLLMSoftware engineeringAutonomous driving
Y
Yifei Wang
Zhejiang University, Hangzhou, China
W
Weitong Lian
Zhejiang University, Hangzhou, China
T
Tianjian Gao
Zhejiang University, Hangzhou, China
Haoran Li
Haoran Li
University of Science and Technology of China
3D Generation 3D Editing 3D Understanding
T
Tengju Ru
Zhejiang University, Hangzhou, China
L
Lingyi Meng
Zhejiang University, Hangzhou, China
Z
Zhejun Cui
Zhejiang University, Hangzhou, China
Y
Yichen Zhu
Zhejiang University, Hangzhou, China
Qi Kang
Qi Kang
同济大学
计算智能、人工智能、机器学习
K
Kaixuan Wang
The University of Hong Kong, Hong Kong, China
Yu Zhang
Yu Zhang
Associate Professor, Zhejiang University
SLAM3D VisionRobotics