Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic gap between low-level visual features and high-level semantics hinders semantic retrieval in remote sensing (RS) imagery. Method: This paper proposes a zero-training, text-only cross-modal retrieval paradigm that reformulates image retrieval as a text-to-text (T2T) matching task. Key components include: (i) leveraging vision-language models (VLMs) to automatically generate structured image descriptions; (ii) constructing RSRT—the first RS benchmark with multi-granularity semantic annotations; and (iii) designing structured prompts and a unified text embedding space to enable fine-tuning-free matching. Results: On the RSITMD dataset, our method achieves a mean recall of 42.62%, nearly doubling zero-shot CLIP performance and surpassing multiple supervised state-of-the-art models—demonstrating for the first time that high-quality textual representations can match supervised learning performance in RS semantic retrieval.

Technology Category

Application Category

📝 Abstract
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the extquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62%, nearly doubling the 23.86% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
Problem

Research questions and friction points this paper is trying to address.

Bridging the semantic gap in remote sensing image retrieval
Eliminating costly domain-specific training for vision-language models
Providing a benchmark for zero-shot retrieval using VLM-generated text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free text-to-text retrieval framework
Leverages VLM-generated captions in unified embedding space
Uses structured text descriptions as semantic queries
🔎 Similar Papers
2024-09-20IEEE Transactions on Geoscience and Remote SensingCitations: 2
J
Jinghao Xiao
School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia
Y
Yiheng Guo
School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia
Xing Zi
Xing Zi
Researcher, University of Technology Sydney
Computer VisionRemote SensingMultimodal
K
Karthick Thiyagarajan
Smart Sensing and Robotics Laboratory (SensR Lab), Centre for Advanced Manufacturing Technology, Western Sydney University, Sydney, Australia
Catarina Moreira
Catarina Moreira
Associate Professor in Machine Learning @Data Science Institute, UTS
Explainable-AIHuman-Centered AIDeep LearningProbabilistic ModelsQuantum Cognition
M
Mukesh Prasad
School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia