ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

📅 2024-10-01

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Image-text models suffer from biased understanding and hallucinated descriptions of traffic safety events (e.g., collisions, near-misses) due to severe class imbalance and sparse samples for rare, high-risk scenarios. Method: We propose a classification-guided vision-language model (VLM) co-training paradigm that jointly optimizes supervised event classification and contrastive learning in a dual-path framework. A structured classification task—explicitly predicting event type and severity—constrains the language generation process; multimodal fusion is implemented atop the CLIP architecture. Domain adaptation is achieved via fine-tuning on SCE video-text pairs and the SHRP2 naturalistic driving dataset. Results: Evaluated on over 8,600 real-world safety-critical events, our method improves event classification accuracy by 12.7% and reduces language generation hallucination rate by 39.5%, significantly enhancing model sensitivity to rare hazardous events and fidelity in modeling safety-critical visual-semantic features.

Technology Category

Application Category

📝 Abstract

Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at https://github.com/datadrivenwheels/ScVLM.

Problem

Research questions and friction points this paper is trying to address.

Image Captioning

Traffic Safety Events

Data Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

ScVLM

rare traffic safety events

learning strategies

🔎 Similar Papers

No similar papers found.