NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit significant limitations in autonomous driving scene understanding—particularly in multi-view fusion and ego-centric spatial reasoning. To address this, we propose BEV-LLM, the first architecture to explicitly inject bird’s-eye-view (BEV) features into an MLLM. We further introduce NuPlanQA-Eval, the first autonomous-driving-oriented multi-view visual question answering (VQA) benchmark, and NuPlanQA-1M, a million-scale real-world dataset covering three core capabilities—road perception, spatial relation identification, and ego-centric reasoning—across nine fine-grained subtasks. A comprehensive, granular evaluation framework is established to assess model performance. Experiments demonstrate that BEV-LLM achieves statistically significant improvements over strong baselines on six subtasks. NuPlanQA is publicly released, establishing a new standard evaluation platform for multimodal understanding in autonomous driving.

Technology Category

Application Category

📝 Abstract

Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' ability to understand multi-view driving scenes.

Introduces a large-scale dataset for driving scene VQA tasks.

Proposes BEV-LLM to enhance MLLMs with BEV features.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NuPlanQA-Eval for driving scene evaluation

Proposes NuPlanQA-1M with 1M VQA pairs

Develops BEV-LLM integrating Bird's-Eye-View features

🔎 Similar Papers

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning