🤖 AI Summary
This study systematically evaluates the capability of vision-language models (VLMs) to generate reliable, visually grounded decision explanations in autonomous driving, revealing critical deficiencies—including susceptibility to textual priors, poor visual robustness, and weak multimodal reasoning. To address this, we introduce DriveBench, the first autonomous-driving-specific benchmark featuring 17 types of visual perturbations, 12 VLMs, and over 19K real-world driving frames. We propose novel evaluation metrics emphasizing visual grounding fidelity and cross-modal consistency, alongside a multimodal robustness assessment framework, adversarial input construction, and text-visual attribution diagnostics. Experiments demonstrate that state-of-the-art VLMs frequently produce high-confidence yet erroneous explanations under visual degradation; our metrics effectively expose latent model weaknesses; and reliability enhancement strategies informed by input degradation awareness are empirically validated as effective.
📝 Abstract
Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.