π€ AI Summary
This work addresses the challenge that existing open-vocabulary remote sensing semantic segmentation methods struggle to distinguish spectrally similar yet semantically distinct land cover types due to a lack of geospatial contextual awareness. To overcome this limitation, we propose the Geospatial Reasoning Chain-of-Thought (GR-CoT) framework, which introduces geospatial contextual reasoning into this task for the first time. GR-CoT dynamically constructs image-adaptive vocabularies through scene anchoring, feature disentanglement, and knowledge-driven decision-making, synergistically combining offline knowledge distillation with online instance-level reasoning to guide pixel-wise semantic alignment. By integrating multimodal large language models, visionβtext alignment, and a geospatial reasoning chain, our method achieves significant performance gains over state-of-the-art approaches on the LoveDA and GID-5 benchmarks, notably improving segmentation accuracy for visually ambiguous land cover classes.
π Abstract
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based"paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.