🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation of risk perception capabilities in safety-critical autonomous driving scenarios, particularly lacking a comprehensive benchmark that jointly addresses external environmental hazards and in-cabin human behaviors.
Method: We introduce DSBench—the first dedicated benchmark for autonomous driving safety risk perception—covering 10 broad risk categories and 28 fine-grained subcategories, with a high-quality annotated dataset of 98K samples. It establishes the first unified dual-domain (external + in-cabin) risk modeling framework and a multi-dimensional, fine-grained, human-in-the-loop evaluation protocol.
Contribution/Results: Zero-shot and fine-tuned evaluations of leading open- and closed-source VLMs reveal substantial performance degradation in complex risk scenarios. Fine-tuning on DSBench significantly improves safety-aware recognition, providing both critical infrastructure and empirical evidence to advance safety-oriented VLM development.
📝 Abstract
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM's awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs' performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.