🤖 AI Summary
Addressing the dual challenges of fine-grained localization and semantic open-set classification in remote sensing imagery, this paper proposes the first unsupervised three-stage framework for land cover analysis. First, a fine-tuned Segment Anything Model (SAM) enables label-free pixel-level mask extraction. Second, a two-stage fine-tuning strategy adapts a multimodal large language model (MLLM) to automatically generate semantic names and contextual descriptions for novel land cover classes. Third, an LLM-as-judge mechanism evaluates the plausibility of generated descriptions. This work pioneers the deep integration of MLLMs into the land cover understanding pipeline, achieving high-precision segmentation and interpretable semantic outputs without any human annotation. Experiments on diverse satellite imagery demonstrate strong generalization capability and human-readable outputs, significantly enhancing the practicality and scalability of automated cartographic updating and large-scale Earth observation analytics.
📝 Abstract
Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization. This involves not only detecting and segmenting novel objects without categorical supervision but also assigning them interpretable semantic labels through multimodal reasoning. In this study, we introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description. The pipeline consists of: (1) precise discovery and mask extraction with a promptable fine-tuned segmentation model (SAM), (2) semantic attribution and contextual description via a two-phase fine-tuned multimodal large language model (MLLM), and (3) LLM-as-judge and manual scoring of the MLLMs evaluation. By combining pixel-level accuracy with high-level semantic understanding, OSDA addresses key challenges in open-world remote sensing interpretation. Designed to be architecture-agnostic and label-free, the framework supports robust evaluation across diverse satellite imagery without requiring manual annotation. Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.