Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current evaluation frameworks lack standardized benchmarks for assessing the safety-aware cognitive capabilities of vision-language models (VLMs) in autonomous driving—particularly in human–machine interaction scenarios. Method: We introduce SCD-Bench, the first safety-cognitive evaluation benchmark tailored to autonomous driving. It establishes a safety-cognition-driven evaluation paradigm, leverages an Autonomous Driving Annotation (ADA) system for scalable multimodal data curation, and integrates expert annotation with LLM-based automated assessment, achieving 99.74% inter-annotator agreement. Contribution/Results: Experiments reveal that mainstream open-source VLMs exhibit substantially weaker safety cognition than GPT-4o; lightweight models (1B–4B parameters) perform near-chance, exposing critical deployment bottlenecks. SCD-Bench provides both a rigorous evaluation standard and a methodological foundation for trustworthy integration of VLMs into safety-critical autonomous driving systems.

Technology Category

Application Category

📝 Abstract

Assessing the safety of vision-language models (VLMs) in autonomous driving is particularly important; however, existing work mainly focuses on traditional benchmark evaluations. As interactive components within autonomous driving systems, VLMs must maintain strong safety cognition during interactions. From this perspective, we propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) . To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text Annotation System (ADA) . Additionally, to ensure data quality in SCD-Bench, our dataset undergoes manual refinement by experts with professional knowledge in autonomous driving. We further develop an automated evaluation method based on large language models (LLMs). To verify its effectiveness, we compare its evaluation results with those of expert human evaluations, achieving a consistency rate of 99.74%. Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition, showing a significant gap compared to GPT-4o. Notably, lightweight models (1B-4B) demonstrate minimal safety cognition. However, since lightweight models are crucial for autonomous driving systems, this presents a significant challenge for integrating VLMs into the field.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety cognition in vision-language models for autonomous driving.

Developing a benchmark and annotation system for safety evaluation.

Assessing lightweight models' safety cognition for autonomous driving integration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Safety Cognitive Driving Benchmark (SCD-Bench)

Created Autonomous Driving Image-Text Annotation System (ADA)

Automated evaluation using large language models (LLMs)

🔎 Similar Papers

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding