🤖 AI Summary
Current evaluation frameworks lack standardized benchmarks for assessing the safety-aware cognitive capabilities of vision-language models (VLMs) in autonomous driving—particularly in human–machine interaction scenarios.
Method: We introduce SCD-Bench, the first safety-cognitive evaluation benchmark tailored to autonomous driving. It establishes a safety-cognition-driven evaluation paradigm, leverages an Autonomous Driving Annotation (ADA) system for scalable multimodal data curation, and integrates expert annotation with LLM-based automated assessment, achieving 99.74% inter-annotator agreement.
Contribution/Results: Experiments reveal that mainstream open-source VLMs exhibit substantially weaker safety cognition than GPT-4o; lightweight models (1B–4B parameters) perform near-chance, exposing critical deployment bottlenecks. SCD-Bench provides both a rigorous evaluation standard and a methodological foundation for trustworthy integration of VLMs into safety-critical autonomous driving systems.
📝 Abstract
Assessing the safety of vision-language models (VLMs) in autonomous driving is particularly important; however, existing work mainly focuses on traditional benchmark evaluations. As interactive components within autonomous driving systems, VLMs must maintain strong safety cognition during interactions. From this perspective, we propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) . To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text Annotation System (ADA) . Additionally, to ensure data quality in SCD-Bench, our dataset undergoes manual refinement by experts with professional knowledge in autonomous driving. We further develop an automated evaluation method based on large language models (LLMs). To verify its effectiveness, we compare its evaluation results with those of expert human evaluations, achieving a consistency rate of 99.74%. Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition, showing a significant gap compared to GPT-4o. Notably, lightweight models (1B-4B) demonstrate minimal safety cognition. However, since lightweight models are crucial for autonomous driving systems, this presents a significant challenge for integrating VLMs into the field.