Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing visual language models (VLMs) for autonomous driving are predominantly evaluated using coarse-grained, open-ended visual question answering, which fails to assess their fine-grained perception and reasoning capabilities in complex, dynamic driving scenarios. To address this gap, we propose VLADBench—the first multi-level, closed-ended, fine-grained benchmark tailored for autonomous driving, covering five cognitive domains: traffic knowledge, element identification, traffic graph generation, target attribute understanding, and ego-vehicle decision-making and planning—comprising 29 hierarchical tasks. Built upon 1.4 million domain-specific QA pairs, VLADBench introduces domain-specific supervised fine-tuning and cross-domain collaborative training paradigms. Experimental results reveal significant limitations of current VLMs in dynamic reasoning. VLADBench establishes a new standard for evaluating driving-oriented VLMs and advances the development of next-generation systems endowed with causal reasoning and deeper cognitive modeling.

Technology Category

Application Category

📝 Abstract

Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $ extbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $ extbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs in autonomous driving scenarios

Assess fine-grained reasoning in dynamic road situations

Develop comprehensive benchmarks for AD-specific VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VLADBench for fine-grained VLM evaluation

Trains domain-specific models on 1.4M QAs

Assesses 5 key domains with 29 tasks

🔎 Similar Papers

No similar papers found.