DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the lack of systematic evaluation and modeling for multi-step visual reasoning in autonomous driving, this paper introduces DriveLMM-o1—the first multi-step reasoning VQA benchmark tailored for driving tasks—comprising over 18k training and 4k test samples, each annotated with chain-of-thought logical reasoning traces. We propose a fine-grained, stepwise evaluation framework that extends beyond conventional VQA metrics focused solely on final answer accuracy. Furthermore, we design the first driving-cognition-oriented, interpretable multimodal large language model, integrating bird’s-eye-view (BEV) representation learning, temporal visual understanding, and reasoning-path distillation. On DriveLMM-o1, our model achieves a 7.49% improvement in final answer accuracy and a 3.62-point gain in reasoning-path fidelity, significantly outperforming existing open-source approaches.

Technology Category

Application Category

📝 Abstract

While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of step-by-step reasoning in autonomous driving VQA tasks.

Introduces DriveLMM-o1 dataset for complex driving scenario understanding.

Proposes a multimodal model improving accuracy and reasoning in driving tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

DriveLMM-o1 dataset for step-wise reasoning

Large multimodal model fine-tuned for driving

Benchmarking reasoning in autonomous driving tasks

🔎 Similar Papers

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning