🤖 AI Summary
To address key bottlenecks of multimodal large language models (MLLMs) in closed-loop end-to-end autonomous driving—including poor generalization, opaque decision-making, and misalignment with motion planning—this paper proposes a unified framework integrating scene prediction and decision reasoning. Methodologically, it introduces a novel dual-path reasoning mechanism that jointly leverages self-supervised next-scene prediction and supervised chain-of-thought (CoT) decision reasoning; constructs PDR, the first planning-oriented decision reasoning dataset (210k samples); and conducts MLLM fine-tuning and knowledge distillation. The core contribution lies in aligning visual representations with executable driving semantics, enabling causally interpretable decisions and strong zero-shot generalization. Experiments demonstrate a 19% reduction in L2 trajectory error and a +16.1-point improvement in driving score on Bench2Drive. Moreover, zero-shot transfer to the DOS benchmark achieves state-of-the-art performance.
📝 Abstract
Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.