VLMs Guided Interpretable Decision Making for Autonomous Driving

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from heavy reliance on handcrafted prompts, poor robustness, and limited contextual awareness when deployed for high-level autonomous driving decision-making. To address these limitations, this work reframes VLMs not as direct decision-makers but as semantic enhancers. We propose a multimodal interaction architecture that fuses ego-view visual features with structured scene descriptions, employ a visual question-answering framework to guide fine-grained semantic alignment, and integrate a post-processing language model to improve decision reliability. Our core innovations lie in decoupling perception from decision-making, enhancing semantic interpretability, and enabling context-adaptive reasoning. Evaluated on two mainstream autonomous driving benchmarks, our approach achieves state-of-the-art performance, significantly improving decision accuracy and cross-scenario generalization while generating human-readable textual justifications for each decision.

Technology Category

Application Category

📝 Abstract

Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.

Problem

Research questions and friction points this paper is trying to address.

VLMs lack robustness in autonomous driving decisions

Existing approaches suffer from inconsistent real-world performance

Current methods lack interpretable context-aware decision explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs generate structured scene descriptions

Multi-modal fusion of visual and linguistic features

Post-hoc refinement enhances prediction reliability

🔎 Similar Papers

Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM