Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenge that existing open-vocabulary remote sensing semantic segmentation methods struggle to distinguish spectrally similar yet semantically distinct land cover types due to a lack of geospatial contextual awareness. To overcome this limitation, we propose the Geospatial Reasoning Chain-of-Thought (GR-CoT) framework, which introduces geospatial contextual reasoning into this task for the first time. GR-CoT dynamically constructs image-adaptive vocabularies through scene anchoring, feature disentanglement, and knowledge-driven decision-making, synergistically combining offline knowledge distillation with online instance-level reasoning to guide pixel-wise semantic alignment. By integrating multimodal large language models, vision–text alignment, and a geospatial reasoning chain, our method achieves significant performance gains over state-of-the-art approaches on the LoveDA and GID-5 benchmarks, notably improving segmentation accuracy for visually ambiguous land cover classes.

Technology Category

Application Category

📝 Abstract

Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based"paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

geospatial reasoning

remote sensing

semantic ambiguity

land-cover classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geospatial Reasoning

Open-vocabulary Semantic Segmentation

Multimodal Large Language Models