Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Semantic gap between low-level visual features and high-level semantics hinders semantic retrieval in remote sensing (RS) imagery. Method: This paper proposes a zero-training, text-only cross-modal retrieval paradigm that reformulates image retrieval as a text-to-text (T2T) matching task. Key components include: (i) leveraging vision-language models (VLMs) to automatically generate structured image descriptions; (ii) constructing RSRT—the first RS benchmark with multi-granularity semantic annotations; and (iii) designing structured prompts and a unified text embedding space to enable fine-tuning-free matching. Results: On the RSITMD dataset, our method achieves a mean recall of 42.62%, nearly doubling zero-shot CLIP performance and surpassing multiple supervised state-of-the-art models—demonstrating for the first time that high-quality textual representations can match supervised learning performance in RS semantic retrieval.

Technology Category

Application Category

📝 Abstract

Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the extquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62%, nearly doubling the 23.86% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

Problem

Research questions and friction points this paper is trying to address.

Bridging the semantic gap in remote sensing image retrieval

Eliminating costly domain-specific training for vision-language models

Providing a benchmark for zero-shot retrieval using VLM-generated text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free text-to-text retrieval framework

Leverages VLM-generated captions in unified embedding space

Uses structured text descriptions as semantic queries

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation