DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing approaches struggle to effectively integrate multimodal sensor data for understanding anomalous scenarios in autonomous driving. To address this challenge, this work introduces DriveXQA, the first dataset specifically designed for such settings, encompassing four visual modalities, diverse sensor failures, and adverse weather conditions. Furthermore, we propose MVX-LLM, a multimodal large language model (MLLM) framework that enables question answering over multiple visual inputs through a token-efficient Dual Cross-Attention projector for cross-modal fusion. Experimental results demonstrate that our method significantly outperforms baseline approaches in challenging conditions such as fog, achieving a GPTScore of 53.5 compared to 25.1. Both the code and the DriveXQA dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Visual Question Answering

Adverse Driving Scenes

Sensor Fusion

Autonomous Driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Visual Question Answering

Dual Cross-Attention