π€ AI Summary
Traditional autonomous driving systems struggle with reliable decision-making in complex, unknown environments due to limited capability in spatial relationship understanding and reasoning. This paper proposes a vision-enhanced large language model (VLM)-based driving assistance system. Methodologically, it innovatively integrates YOLOv4 and Vision Transformer to construct a multi-granularity visual adapter, and couples it with GPT-4 to build a spatial reasoning module, jointly optimizing situation awareness, natural language description, and trustworthy decision-making. Leveraging multimodal feature alignment and trust-aware evaluation protocols, empirical testing with 45 experienced drivers demonstrates that the system achieves near-human accuracy in situational description and attains moderate-to-strong agreement with human decisions (Cohenβs ΞΊ = 0.58). The approach significantly advances semantic comprehension and interpretable, reasoning-based decision support in complex driving scenarios.
π Abstract
Traditional autonomous driving systems often struggle with reasoning in complex, unexpected scenarios due to limited comprehension of spatial relationships. In response, this study introduces a Large Language Model (LLM)-based Autonomous Driving (AD) assistance system that integrates a vision adapter and an LLM reasoning module to enhance visual understanding and decision-making. The vision adapter, combining YOLOv4 and Vision Transformer (ViT), extracts comprehensive visual features, while GPT-4 enables human-like spatial reasoning and response generation. Experimental evaluations with 45 experienced drivers revealed that the system closely mirrors human performance in describing situations and moderately aligns with human decisions in generating appropriate responses.