🤖 AI Summary
Current vision-language models (VLMs) exhibit hallucination and poor generalization under out-of-distribution (OOD) conditions—particularly in complex, dynamic driving scenarios—hindering their real-world deployment in end-to-end autonomous driving. To address this, we propose a memory–tool collaborative closed-loop reasoning framework: contextual awareness is enhanced via experience-based memory retrieval, while dynamic tool invocation enables proactive reasoning and decision-making for long-tail situations (e.g., road construction). We further introduce Roadwork-VLM—the first benchmark specifically designed to evaluate VLMs on construction-related driving scenarios. Our method achieves PDMS 88.3, high-level planning accuracy of 79.8%, and overall planning accuracy of 82.6% on NAVSIM; under zero-shot evaluation on Roadwork-VLM, it attains 80.2%, substantially improving robustness and generalization of VLMs in complex, dynamic environments.
📝 Abstract
Vision-Language Models(VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRDrive, a novel framework that integrates procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making.
MTRDrive addresses these limitations through a closed-loop system that combines a memory-based experience retrieval mechanism with dynamic toolkits. This synergy enables the model to interact more effectively with its environment, improving both reasoning and decision-making capabilities with the help of our memory-tool synergistic reasoning. Additionally, we introduce a new benchmark based on complex Roadwork construction scenarios to rigorously evaluate zero-shot generalization.
Extensive experiments demonstrate the superior effectiveness of our approach. On the public NAVSIM benchmark, our 3B-parameter MTRDrive model achieves an exceptional PDMS of 88.3 without chain-of-thought and sets a state-of-the-art performance bar on high-level planning, with a driving metric score of 79.8% and a planning accuracy of 82.6%. Rigorous zero-shot evaluation on the new Roadwork-VLM benchmark shows a strong ability to reason robustly in unseen scenarios, achieving a driving metric score of 80.2%. These results highlight MTRDrive's potential to advance autonomous driving toward safer and more reliable systems.