🤖 AI Summary
Existing end-to-end autonomous driving methods suffer from a fundamental disconnect between trajectory generation and decision evaluation: generative approaches lack multi-objective reasoning capabilities, while selection-based methods are constrained by the quality of candidate trajectories. To address this, we propose MindDrive—the first framework to synergistically integrate a world model with a vision-language model (VLM), establishing a cognition-driven paradigm comprising *situation simulation*, *candidate generation*, and *multi-objective trade-off*. Its core components are: (i) the Future-aware Trajectory Generator (FaTG), a world-action-model-based module enabling self-conditioned, high-fidelity trajectory synthesis; and (ii) the VLM-powered Multi-Objective Evaluator (VLoE), which provides interpretable, structured assessment across safety, comfort, and efficiency. Evaluated on NAVSIM-v1/v2, MindDrive achieves state-of-the-art performance, significantly improving safety, regulatory compliance, and cross-scenario generalization—demonstrating the efficacy of cognition-guided driving.
📝 Abstract
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.