🤖 AI Summary
This work addresses the challenge that general-purpose vision-language models (VLMs) struggle to effectively detect rare, transient, yet safety-critical events—such as collisions and near-misses—in autonomous driving scenarios. To bridge this gap, the authors propose a modular post-training framework that aligns off-the-shelf VLMs (e.g., Cosmos-Reason1) with the driving domain by integrating metadata captions, large language model–generated descriptions, visual question-answering pairs, and Chain-of-Thought reasoning supervision. This approach uniquely combines multimodal post-training with interpretable reasoning, enabling significant performance gains on real-world Nexar dashcam videos: the F1 score for collision detection improves from 0.00 to 0.69, and overall accuracy rises from 35.35% to 77.27%. The method markedly enhances the model’s perception, reasoning capabilities, and decision traceability for safety-critical events.
📝 Abstract
The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment.
We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%.
VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.