RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

📅 2024-12-15

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address safety risks arising from insufficient understanding of long-tail hazardous scenarios (corner cases) in autonomous driving, this paper proposes a retrieval-augmented vision-language model (VLM) comprehension framework. Methodologically, it integrates retrieval-augmented generation (RAG) with contrastive learning–driven fine-tuning of image-text joint embeddings, enabling dynamic external knowledge injection and fine-grained cross-modal alignment. The framework performs end-to-end optimization of LLaVA-v1.6-34B on a custom-built corner case dataset. Its core contribution is the first retrieval-augmented comprehension paradigm specifically designed for corner cases, significantly improving semantic accuracy and real-world grounding: Cosine Similarity increases by 5.22%, ROUGE-L F1 by 39.91%, and Precision by 55.80%. The approach effectively mitigates hallucination and enhances generation consistency.

Technology Category

Application Category

📝 Abstract

Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22%. The F1-score in ROUGE-L has increased by 39.91%, the Precision has increased by 55.80%, and the Recall has increased by 13.74%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.

Problem

Research questions and friction points this paper is trying to address.

Enhance corner case comprehension in autonomous driving using VLMs

Address VLMs' hallucination and insufficient real-world grounding issues

Improve safety and interpretability of autonomous driving systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-spatial fusion image encoder

Cross-modal alignment with negative mining

KMeans and HNSW fast querying pipeline

🔎 Similar Papers

No similar papers found.