MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end autonomous driving methods suffer from a fundamental disconnect between trajectory generation and decision evaluation: generative approaches lack multi-objective reasoning capabilities, while selection-based methods are constrained by the quality of candidate trajectories. To address this, we propose MindDrive—the first framework to synergistically integrate a world model with a vision-language model (VLM), establishing a cognition-driven paradigm comprising *situation simulation*, *candidate generation*, and *multi-objective trade-off*. Its core components are: (i) the Future-aware Trajectory Generator (FaTG), a world-action-model-based module enabling self-conditioned, high-fidelity trajectory synthesis; and (ii) the VLM-powered Multi-Objective Evaluator (VLoE), which provides interpretable, structured assessment across safety, comfort, and efficiency. Evaluated on NAVSIM-v1/v2, MindDrive achieves state-of-the-art performance, significantly improving safety, regulatory compliance, and cross-scenario generalization—demonstrating the efficacy of cognition-guided driving.

Technology Category

Application Category

📝 Abstract
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Integrates trajectory generation with decision reasoning for autonomous driving
Uses world models to simulate future scenes and generate trajectories
Employs vision-language models for multi-objective evaluation of driving decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Model for ego-conditioned future simulation
Vision-language model for multi-objective trajectory evaluation
Structured reasoning paradigm integrating generation and selection
B
Bin Sun
School of Transportation Science and Engineering, Beihang University
Y
Yaoguang Cao
State Key Laboratory of Intelligent Transportation System, Beihang University; Hangzhou International Innovation Institute, Beihang University
Y
Yan Wang
School of Transportation Science and Engineering, Beihang University
R
Rui Wang
School of Transportation Science and Engineering, Beihang University
J
Jiachen Shang
School of Transportation Science and Engineering, Beihang University
X
Xiejie Feng
School of Transportation Science and Engineering, Beihang University
Jiayi Lu
Jiayi Lu
Beihang University
Autonomous VehicleComputer VisionSOTIFADAS
J
Jia Shi
China Automotive Engineering Research Institute Co., Ltd.
S
Shichun Yang
School of Transportation Science and Engineering, Beihang University
X
Xiaoyu Yan
Research Institute of Aero-Engine, Beihang University
Ziying Song
Ziying Song
Beijing Jiaotong University
Object DetectionComputer VisionDeep Learning