SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Remote sensing image segmentation guided by natural language faces challenges including multi-object coordination, multi-granularity parsing, implicit intent understanding, and linguistic diversity—limiting existing methods’ performance in complex geospatial scenes. To address these, we propose LaSeRS: the first large-scale, task-specific benchmark for language-guided remote sensing segmentation. Our method introduces spatial attention supervision to enhance precise localization of small objects and a dynamic segmentation query mechanism enabling adaptive, multi-object collaborative segmentation. Furthermore, we integrate a multimodal large language model with a hierarchical language-pixel alignment training strategy. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods on LaSeRS and multiple mainstream benchmarks, establishing a new strong baseline for geospatial semantic segmentation. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.

Problem

Research questions and friction points this paper is trying to address.

Segment objects at various granularities in remote sensing images

Execute multi-target instructions for geospatial segmentation tasks

Interpret implicit user intent in language-guided segmentation scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for complex geospatial reasoning

Spatial attention supervision for small object localization

Flexible segmentation query for multi-target scenarios

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation