Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge that multimodal large language models struggle to accurately perceive rare, high-risk dynamic events—such as collisions—in safety-critical driving scenarios. To overcome this limitation, the study introduces a novel approach that deeply integrates high-frequency IMU/GPS telemetry data with visual semantic information to construct question-answering–style pseudo-label training data tailored for safety-critical events. Building upon the Qwen-VL 2.5 architecture and employing the DoRA parameter-efficient fine-tuning strategy, the proposed method achieves substantial improvements in both accuracy and interpretability of high-risk event recognition in driving videos, using fewer than 50 million trainable parameters and modest computational resources.

📝 Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Safety-Critical Events

Driving Video Analysis

Dynamic Event Perception

Rare High-Stakes Scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Safety-Critical Events

Telematics Fusion