🤖 AI Summary
Vision-language large language models (VLLMs) lack rigorous evaluation frameworks for safety-critical applications such as L3+ autonomous driving. Method: This paper introduces DVBench—the first VLLM benchmark tailored for L3+ autonomous driving—built upon ISO/PAS 21448 (SOTIF) to define a hierarchical capability taxonomy. It comprises 10,000 human-annotated multiple-choice questions derived from complex, dynamic traffic videos, targeting fine-grained perception and reasoning assessment. We propose a safety-driven VLLM evaluation paradigm, release the first fine-grained driving video annotation benchmark, and support both LoRA and full-parameter fine-tuning. Contribution/Results: Comprehensive evaluation across 14 state-of-the-art VLLMs (0.5B–72B parameters) reveals a baseline average accuracy of only 39.8%; domain-adaptive fine-tuning improves performance by 5.24–10.94 percentage points (up to 43.6% relative gain). The DVBench toolkit—including annotations, evaluation scripts, and fine-tuned models—is open-sourced to advance safe, trustworthy VLLM development.
📝 Abstract
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.