๐ค AI Summary
To address safety risks arising from insufficient understanding of long-tail hazardous scenarios (corner cases) in autonomous driving, this paper proposes a retrieval-augmented vision-language model (VLM) comprehension framework. Methodologically, it integrates retrieval-augmented generation (RAG) with contrastive learningโdriven fine-tuning of image-text joint embeddings, enabling dynamic external knowledge injection and fine-grained cross-modal alignment. The framework performs end-to-end optimization of LLaVA-v1.6-34B on a custom-built corner case dataset. Its core contribution is the first retrieval-augmented comprehension paradigm specifically designed for corner cases, significantly improving semantic accuracy and real-world grounding: Cosine Similarity increases by 5.22%, ROUGE-L F1 by 39.91%, and Precision by 55.80%. The approach effectively mitigates hallucination and enhances generation consistency.
๐ Abstract
Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22%. The F1-score in ROUGE-L has increased by 39.91%, the Precision has increased by 55.80%, and the Recall has increased by 13.74%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.