🤖 AI Summary
Current vision-language models (VLMs) suffer from heavy reliance on handcrafted prompts, poor robustness, and limited contextual awareness when deployed for high-level autonomous driving decision-making. To address these limitations, this work reframes VLMs not as direct decision-makers but as semantic enhancers. We propose a multimodal interaction architecture that fuses ego-view visual features with structured scene descriptions, employ a visual question-answering framework to guide fine-grained semantic alignment, and integrate a post-processing language model to improve decision reliability. Our core innovations lie in decoupling perception from decision-making, enhancing semantic interpretability, and enabling context-adaptive reasoning. Evaluated on two mainstream autonomous driving benchmarks, our approach achieves state-of-the-art performance, significantly improving decision accuracy and cross-scenario generalization while generating human-readable textual justifications for each decision.
📝 Abstract
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.