Investigating Traffic Accident Detection Using Multimodal Large Language Models

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the heavy reliance of roadside camera-based accident detection on large-scale annotated data, this paper proposes a zero-shot multimodal reasoning framework. First, a synthetic DeepAccident dataset is constructed to mitigate the scarcity of real-world accident samples. Second, structured visual prompts are generated by integrating YOLO (object detection), Deep SORT (multi-object tracking), and SAM (instance segmentation), thereby enhancing the spatiotemporal reasoning capabilities of multimodal large language models (MLLMs). Evaluation on Gemini, Gemma-3, and Pixtral demonstrates that Pixtral achieves an F1-score of 0.71 and recall of 83%; prompt-optimized Gemini attains 90% precision; and Gemma-3 exhibits superior robustness. This work pioneers the synergistic integration of vision foundation models and MLLMs for zero-shot traffic incident understanding—significantly improving accident recognition accuracy, interpretability, and system scalability without requiring task-specific labeled data.

Technology Category

Application Category

📝 Abstract

Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of acci- dents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in acci- dent identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi- object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.

Problem

Research questions and friction points this paper is trying to address.

Detecting traffic accidents using multimodal language models from infrastructure camera images

Evaluating zero-shot MLLM capabilities without extensive labeled training datasets

Integrating visual analytics with MLLMs to improve accident detection accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot MLLMs detect accidents without labeled data

Simulated DeepAccident dataset addresses real data scarcity

Enhanced prompts integrate YOLO, Deep SORT, and SAM

🔎 Similar Papers

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding