🤖 AI Summary
This work addresses the limited evaluation scope of existing multimodal large models for autonomous driving, which predominantly focus on the ego-vehicle perspective and lack systematic assessment of roadside and vehicle-to-everything (V2X) cooperative scenarios. To bridge this gap, the authors introduce V2X-QA, the first real-world multimodal question-answering dataset encompassing vehicle-side, roadside, and collaborative viewpoints, along with a twelve-task taxonomy spanning perception, prediction, and reasoning-planning. They propose a decoupled viewpoint evaluation protocol and a unified multiple-choice QA framework to enable fine-grained capability diagnosis. Furthermore, they develop a V2X-MoE baseline model incorporating explicit viewpoint routing and dedicated LoRA experts. Experiments across ten mainstream models demonstrate that the roadside perspective substantially enhances macroscopic traffic understanding, while collaborative reasoning remains challenging, validating the efficacy of viewpoint specialization in multimodal cooperative reasoning.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.