Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

📅 2024-09-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

In zero-shot scene classification of remote sensing images, conventional patch-wise independent inference suffers from contextual fragmentation and poor generalization. Method: This paper proposes an inductive–deductive collaborative transductive inference framework—the first to introduce unsupervised transductive reasoning into remote sensing vision-language models. It constructs a patch-level semantic affinity graph to jointly model local–global alignment between text prompts and image encodings, enabling cross-patch knowledge propagation and contextual enhancement. Built upon a CLIP variant architecture, the method integrates prompt engineering with patch-level affinity modeling—requiring no additional annotations or fine-tuning. Results: Evaluated on ten mainstream remote sensing datasets, it achieves an average 5.2% improvement in zero-shot classification accuracy, with computational overhead increased by less than 3%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP

Problem

Research questions and friction points this paper is trying to address.

Remote Sensing Image Recognition

Contextual Inference

Unseen Scene Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Model

Conductive Reasoning

Unseen Remote Sensing Images

🔎 Similar Papers

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment