Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual language models (VLMs) for autonomous driving are predominantly evaluated using coarse-grained, open-ended visual question answering, which fails to assess their fine-grained perception and reasoning capabilities in complex, dynamic driving scenarios. To address this gap, we propose VLADBench—the first multi-level, closed-ended, fine-grained benchmark tailored for autonomous driving, covering five cognitive domains: traffic knowledge, element identification, traffic graph generation, target attribute understanding, and ego-vehicle decision-making and planning—comprising 29 hierarchical tasks. Built upon 1.4 million domain-specific QA pairs, VLADBench introduces domain-specific supervised fine-tuning and cross-domain collaborative training paradigms. Experimental results reveal significant limitations of current VLMs in dynamic reasoning. VLADBench establishes a new standard for evaluating driving-oriented VLMs and advances the development of next-generation systems endowed with causal reasoning and deeper cognitive modeling.

Technology Category

Application Category

📝 Abstract
Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $ extbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $ extbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs in autonomous driving scenarios
Assess fine-grained reasoning in dynamic road situations
Develop comprehensive benchmarks for AD-specific VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VLADBench for fine-grained VLM evaluation
Trains domain-specific models on 1.4M QAs
Assesses 5 key domains with 29 tasks
🔎 Similar Papers
No similar papers found.
Y
Yue Li
University of Science and Technology of China
M
Meng Tian
Huawei Noah’s Ark Lab
Z
Zhenyu Lin
Huawei Noah’s Ark Lab
Jiangtong Zhu
Jiangtong Zhu
XJTU
D
Dechang Zhu
Huawei Noah’s Ark Lab
H
Haiqiang Liu
Huawei Noah’s Ark Lab
Zining Wang
Zining Wang
Beihang University
Yueyi Zhang
Yueyi Zhang
Miromind, Previously University of Science and Technology of China
Structured lightDepth SensingEvent CameraMedical Imaging
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis
X
Xinhai Zhao
Huawei Noah’s Ark Lab