MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing end-to-end autonomous driving methods suffer from a fundamental disconnect between trajectory generation and decision evaluation: generative approaches lack multi-objective reasoning capabilities, while selection-based methods are constrained by the quality of candidate trajectories. To address this, we propose MindDrive—the first framework to synergistically integrate a world model with a vision-language model (VLM), establishing a cognition-driven paradigm comprising *situation simulation*, *candidate generation*, and *multi-objective trade-off*. Its core components are: (i) the Future-aware Trajectory Generator (FaTG), a world-action-model-based module enabling self-conditioned, high-fidelity trajectory synthesis; and (ii) the VLM-powered Multi-Objective Evaluator (VLoE), which provides interpretable, structured assessment across safety, comfort, and efficiency. Evaluated on NAVSIM-v1/v2, MindDrive achieves state-of-the-art performance, significantly improving safety, regulatory compliance, and cross-scenario generalization—demonstrating the efficacy of cognition-guided driving.

Technology Category

Application Category

📝 Abstract

End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Integrates trajectory generation with decision reasoning for autonomous driving

Uses world models to simulate future scenes and generate trajectories

Employs vision-language models for multi-objective evaluation of driving decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

World Action Model for ego-conditioned future simulation

Vision-language model for multi-objective trajectory evaluation

Structured reasoning paradigm integrating generation and selection

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving