RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the challenge of unreliable high-level meta-action decision-making in autonomous driving caused by weak spatial perception and hallucination in vision-language models (VLMs), this paper proposes RAD, a retrieval-augmented decision framework. RAD innovatively adapts the retrieval-augmented generation (RAG) paradigm to meta-action reasoning, implementing a three-stage pipeline—embedding, retrieval, and generation—and is fine-tuned specifically on the NuScenes dataset to enhance bird’s-eye view (BEV) understanding and spatial relation modeling. By grounding VLM reasoning in retrieved, spatially consistent evidence, RAD substantially mitigates spatial misjudgments and semantic hallucinations. On a customized NuScenes test set, RAD achieves statistically significant improvements over baselines in meta-action matching accuracy (+8.2%), F1 score (+7.6%), and overall composite score (+9.1%), demonstrating substantial gains in both reliability and robustness for autonomous driving decision-making.

Technology Category

Application Category

📝 Abstract

Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance meta-action decision-making in autonomous driving

Address limitations of vision-language models in spatial perception

Improve decision accuracy using retrieval-augmented generation pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation pipeline enhances decision accuracy

Fine-tuned VLMs improve spatial perception and image comprehension

Three-stage process: embedding, retrieving, and generating flows

🔎 Similar Papers

No similar papers found.