🤖 AI Summary
Large multimodal models (LMMs) exhibit insufficient fine-grained spatial reasoning capabilities in autonomous driving user interactions, undermining explanation fidelity and user trust.
Method: This paper proposes a driving-oriented logical retrieval-augmented generation (RAG) framework. Its core innovation is a dynamic visual-spatial knowledge base explicitly encoding road-object spatial relations in first-order logic (FOL), supporting dual-modal knowledge injection—both FOL and natural language. A modular architecture integrates vision perception, query-to-logic embedding, and symbolic reasoning, enabling seamless incorporation of expert domain knowledge.
Contribution/Results: Evaluated on synthetic and real-world driving video datasets, the method boosts spatial question-answering accuracy of GPT-4V and Claude 3.5 from 55% to 80% and from 75% to 90%, respectively—a 15-percentage-point gain achieved solely via factual context augmentation, without model retraining.
📝 Abstract
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.