🤖 AI Summary
Remote sensing text-to-image retrieval faces two key challenges: weak interpretability and difficulty modeling complex spatial relationships. This paper proposes RUNE, an explicit neural-symbolic reasoning framework that departs from implicit joint embedding paradigms. RUNE leverages large language models (LLMs) to translate natural language queries into first-order logic (FOL) expressions and performs interpretable symbolic reasoning over DOTA-enhanced object detection entities. It introduces the first remote sensing “text-to-logic” translation paradigm; designs a logic decomposition strategy to enhance reasoning scalability; and proposes two novel evaluation metrics—RRQC (Robustness to Complex Queries) and RRIU (Robustness to Image Uncertainty). On the enhanced DOTA benchmark, RUNE significantly outperforms state-of-the-art RS-LVLMs, achieving substantial gains in accuracy on complex queries and interpretability. Its strong robustness and practical utility are further validated on real-world post-disaster satellite image retrieval tasks.
📝 Abstract
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.