🤖 AI Summary
Current zero-shot multimodal large language models exhibit limited performance in fine-grained, spatially localized driving risk assessment, falling short of the precise situational awareness required for autonomous driving. This work proposes DriveSafe, a novel framework that introduces explicit linguistic scene representations into risk evaluation. By fine-tuning a multimodal large language model with a lightweight adapter, DriveSafe generates structured natural language descriptions that integrate motion, spatial, and depth cues, enabling accurate identification of hazardous objects and their unsafe behaviors, along with actionable safety recommendations. Evaluated on the DRAMA benchmark, DriveSafe significantly outperforms both zero-shot baselines and existing domain-specific methods, with ablation studies confirming the efficacy of its core design components.
📝 Abstract
Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe