Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large multimodal models (LMMs) exhibit insufficient fine-grained spatial reasoning capabilities in autonomous driving user interactions, undermining explanation fidelity and user trust. Method: This paper proposes a driving-oriented logical retrieval-augmented generation (RAG) framework. Its core innovation is a dynamic visual-spatial knowledge base explicitly encoding road-object spatial relations in first-order logic (FOL), supporting dual-modal knowledge injection—both FOL and natural language. A modular architecture integrates vision perception, query-to-logic embedding, and symbolic reasoning, enabling seamless incorporation of expert domain knowledge. Contribution/Results: Evaluated on synthetic and real-world driving video datasets, the method boosts spatial question-answering accuracy of GPT-4V and Claude 3.5 from 55% to 80% and from 75% to 90%, respectively—a 15-percentage-point gain achieved solely via factual context augmentation, without model retraining.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.

Problem

Research questions and friction points this paper is trying to address.

Enhances spatial reasoning in autonomous driving systems

Improves accuracy of visual-spatial queries in driving scenarios

Addresses limitations of large multimodal models in fine-grained spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Logic-RAG enhances LMMs with visual-spatial knowledge.

Dynamic KB built using FOL for object relationships.

Improves LMM accuracy in driving scenarios significantly.

🔎 Similar Papers

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding