VLMs Guided Interpretable Decision Making for Autonomous Driving

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) suffer from heavy reliance on handcrafted prompts, poor robustness, and limited contextual awareness when deployed for high-level autonomous driving decision-making. To address these limitations, this work reframes VLMs not as direct decision-makers but as semantic enhancers. We propose a multimodal interaction architecture that fuses ego-view visual features with structured scene descriptions, employ a visual question-answering framework to guide fine-grained semantic alignment, and integrate a post-processing language model to improve decision reliability. Our core innovations lie in decoupling perception from decision-making, enhancing semantic interpretability, and enabling context-adaptive reasoning. Evaluated on two mainstream autonomous driving benchmarks, our approach achieves state-of-the-art performance, significantly improving decision accuracy and cross-scenario generalization while generating human-readable textual justifications for each decision.

Technology Category

Application Category

📝 Abstract
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
Problem

Research questions and friction points this paper is trying to address.

VLMs lack robustness in autonomous driving decisions
Existing approaches suffer from inconsistent real-world performance
Current methods lack interpretable context-aware decision explanations
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs generate structured scene descriptions
Multi-modal fusion of visual and linguistic features
Post-hoc refinement enhances prediction reliability
🔎 Similar Papers
No similar papers found.
X
Xin Hu
Department of Computer Science, Tulane University
Taotao Jing
Taotao Jing
Tulane University
Transfer LearningDeep LearningDomain AdaptationComputer Vision
R
Renran Tian
Department of Industrial and Systems Engineering, North Carolina State University
Zhengming Ding
Zhengming Ding
Assistant Professor of Computer Science, Tulane University
Machine LearningComputer Vision