VLMPlanner: Integrating Visual Language Models with Motion Planning

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing autonomous driving planners predominantly rely on abstracted perception outputs or map data, neglecting fine-grained visual cues—such as road surface texture, sudden obstacles, or accident aftermath—which undermines decision robustness in complex scenarios. To address this, we propose VLMPlanner: the first motion planning framework to deeply integrate multi-view visual language models (VLMs) into end-to-end planning. It introduces a Context-Adaptive Inference Gate (CAI-Gate) that dynamically modulates VLM invocation frequency, emulating human drivers’ “selective attention” to balance real-time performance and commonsense reasoning. Adopting a hybrid architecture combining a learned planner with VLM guidance, VLMPlanner processes raw camera images directly. Evaluated on the nuPlan benchmark, it significantly outperforms state-of-the-art methods, achieving a 12.7% improvement in planning success rate under challenging conditions—including unstructured roads and high-density interactive scenarios—demonstrating the critical value of joint vision-language modeling for safe, interpretable decision-making.

Technology Category

Application Category

📝 Abstract

Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.

Problem

Research questions and friction points this paper is trying to address.

Enhancing autonomous driving decision-making with visual context

Bridging perception gaps in motion planning using VLMs

Balancing planning performance and computational efficiency dynamically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates VLM with real-time motion planning

Uses multi-view images for detailed visual context

Implements CAI-Gate for adaptive inference frequency

🔎 Similar Papers

No similar papers found.