RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current remote sensing vision-language models (VLMs) are constrained by closed-world assumptions and lack of external knowledge, limiting their capacity for complex semantic reasoning requiring domain- or world-level knowledge. To address this, we propose RS-RAG, the first retrieval-augmented generation framework tailored for remote sensing. We introduce RSWK—the first multimodal dataset integrating global landmark knowledge, covering 175 countries and 14,000 landmarks with aligned image-text pairs. Our method comprises: (1) construction of a multimodal vector database; (2) cross-modal retrieval and re-ranking; (3) joint embedding of high-resolution imagery and textual knowledge; (4) VLM fine-tuning; and (5) knowledge-aware prompt engineering. Extensive experiments demonstrate substantial improvements over state-of-the-art methods on image captioning, classification, and visual question answering—validating that external knowledge injection meaningfully enhances semantic reasoning capabilities in remote sensing.

Technology Category

Application Category

📝 Abstract
Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Bridging remote sensing imagery with comprehensive knowledge using multi-modal data
Enhancing semantic reasoning for complex queries with domain-specific knowledge
Improving vision-language tasks like captioning and QA via retrieval-augmented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset integrating satellite imagery and text
Retrieval-Augmented Generation framework for remote sensing
Knowledge-augmented prompt guiding VLM responses
🔎 Similar Papers
No similar papers found.
C
Congcong Wen
School of Cyber Science and Technology, University of Science and Technology of China, Anhui, 230026, China; Department of Electrical and Computer Engineering, New York University Abu Dhabi, Abu Dhabi, UAE
Y
Yiting Lin
School of Cyber Science and Technology, University of Science and Technology of China, Anhui, 230026, China
X
Xiaokang Qu
School of Cyber Science and Technology, University of Science and Technology of China, Anhui, 230026, China
N
Nan Li
China Academy of Electronics and Information Technology, Beijing 100846, China
Yong Liao
Yong Liao
University of Science and Technology of China
network securitydata miningInternet routingnetwork virtualization
H
Hui Lin
China Academy of Electronics and Information Technology, Beijing 100846, China
X
Xiang Li
Department of Computer Science at the University of Reading, Reading RG6 6AH, UK