RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

๐Ÿ“… 2024-12-15
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address safety risks arising from insufficient understanding of long-tail hazardous scenarios (corner cases) in autonomous driving, this paper proposes a retrieval-augmented vision-language model (VLM) comprehension framework. Methodologically, it integrates retrieval-augmented generation (RAG) with contrastive learningโ€“driven fine-tuning of image-text joint embeddings, enabling dynamic external knowledge injection and fine-grained cross-modal alignment. The framework performs end-to-end optimization of LLaVA-v1.6-34B on a custom-built corner case dataset. Its core contribution is the first retrieval-augmented comprehension paradigm specifically designed for corner cases, significantly improving semantic accuracy and real-world grounding: Cosine Similarity increases by 5.22%, ROUGE-L F1 by 39.91%, and Precision by 55.80%. The approach effectively mitigates hallucination and enhances generation consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22%. The F1-score in ROUGE-L has increased by 39.91%, the Precision has increased by 55.80%, and the Recall has increased by 13.74%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.
Problem

Research questions and friction points this paper is trying to address.

Enhance corner case comprehension in autonomous driving using VLMs
Address VLMs' hallucination and insufficient real-world grounding issues
Improve safety and interpretability of autonomous driving systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-spatial fusion image encoder
Cross-modal alignment with negative mining
KMeans and HNSW fast querying pipeline
๐Ÿ”Ž Similar Papers
No similar papers found.
Yujin Wang
Yujin Wang
Ph.D. Student, Tongji University
Q
Quanfeng Liu
Jiaqi Fan
Jiaqi Fan
Tongji University
intelligent transportation systems
J
Jinlong Hong
School of Automotive Studies, Tongji University, Shanghai 201804, China
H
Hongqing Chu
School of Automotive Studies, Tongji University, Shanghai 201804, China
Mengjian Tian
Mengjian Tian
College of Urban Transportation and Logistics, Shenzhen Technology University, Shenzhen 518118, China
Bingzhao Gao
Bingzhao Gao
Professor, School of Automotive Studies, Tongji University
H
Hong Chen
College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China