🤖 AI Summary
This work addresses the limitations of existing autonomous driving trajectory planning methods, which predominantly rely on single-step reasoning and struggle with complex, long-tail scenarios requiring iterative refinement. To overcome this, we propose MTDrive, a novel framework that introduces, for the first time, a multi-turn interactive reinforcement learning paradigm, enabling multimodal large language models to iteratively refine driving trajectories based on environmental feedback. Our key contributions include the multi-turn group relative policy optimization (mtGRPO) algorithm, the first interactive trajectory understanding dataset supporting multi-turn training, and a system-level training acceleration pipeline integrating high-resolution image transmission with closed-loop simulation. Evaluated on the NAVSIM benchmark, MTDrive substantially outperforms current approaches, achieving a 2.5× improvement in training throughput and demonstrating the efficacy of multi-turn reasoning in autonomous driving planning.
📝 Abstract
Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing"long-tail"scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.