MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses key challenges in applying chain-of-thought reasoning with vision-language models to autonomous driving, particularly the misalignment between textual semantic space and physical trajectory space, as well as the absence of goal-oriented guidance for scene evolution. To bridge this gap, the authors propose a progressive multimodal reasoning framework that emulates human-like staged cognition—sequentially performing semantic understanding, imagination-based mapping from semantics to physical space, and trajectory planning. The framework introduces a novel progressive reasoning mechanism, augmented by feedback-guided automatic multimodal data annotation and a progressive reinforcement fine-tuning strategy, which jointly align semantic comprehension with physical planning. The method achieves state-of-the-art performance on both nuScenes open-loop and Bench2Drive closed-loop evaluation benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

vision-language models

chain-of-thought

multimodal reasoning

trajectory planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Multimodal Reasoning

Vision-Language Models

Chain-of-Thought