đ¤ AI Summary
Existing autonomous driving planners predominantly rely on abstracted perception outputs or map data, neglecting fine-grained visual cuesâsuch as road surface texture, sudden obstacles, or accident aftermathâwhich undermines decision robustness in complex scenarios. To address this, we propose VLMPlanner: the first motion planning framework to deeply integrate multi-view visual language models (VLMs) into end-to-end planning. It introduces a Context-Adaptive Inference Gate (CAI-Gate) that dynamically modulates VLM invocation frequency, emulating human driversâ âselective attentionâ to balance real-time performance and commonsense reasoning. Adopting a hybrid architecture combining a learned planner with VLM guidance, VLMPlanner processes raw camera images directly. Evaluated on the nuPlan benchmark, it significantly outperforms state-of-the-art methods, achieving a 12.7% improvement in planning success rate under challenging conditionsâincluding unstructured roads and high-density interactive scenariosâdemonstrating the critical value of joint vision-language modeling for safe, interpretable decision-making.
đ Abstract
Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.