DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Current zero-shot multimodal large language models exhibit limited performance in fine-grained, spatially localized driving risk assessment, falling short of the precise situational awareness required for autonomous driving. This work proposes DriveSafe, a novel framework that introduces explicit linguistic scene representations into risk evaluation. By fine-tuning a multimodal large language model with a lightweight adapter, DriveSafe generates structured natural language descriptions that integrate motion, spatial, and depth cues, enabling accurate identification of hazardous objects and their unsafe behaviors, along with actionable safety recommendations. Evaluated on the DRAMA benchmark, DriveSafe significantly outperforms both zero-shot baselines and existing domain-specific methods, with ablation studies confirming the efficacy of its core design components.
📝 Abstract
Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe
Problem

Research questions and friction points this paper is trying to address.

risk detection
safety suggestions
autonomous driving
spatially grounded assessment
multimodal perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Risk Assessment
Spatially Grounded Captions
Domain-Specific Adaptation
Driving Safety
🔎 Similar Papers
No similar papers found.