RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing referring segmentation methods suffer from error propagation under semantic ambiguity and limited generalizability and interpretability due to the tight coupling of localization and segmentation. To address this, we propose RSRefSeg 2, the first decoupled two-stage framework: (1) a cross-modal coarse localization stage leveraging CLIP, coupled with cascaded second-order text prompting to precisely activate multi-entity semantics; and (2) a fine-grained pixel-level segmentation stage guided by semantic prompts using an enhanced Segment Anything Model (SAM), enabling end-to-end vision-language alignment. This design significantly improves model interpretability and out-of-distribution generalization. Evaluated on RefSegRS, RRSIS-D, and RISBench, RSRefSeg 2 achieves an average gain of ~3% in grounding IoU (gIoU), with particularly strong performance on complex semantic understanding tasks.

Technology Category

Application Category

📝 Abstract
Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP's cross-modal alignment strength with SAM's segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP's misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.
Problem

Research questions and friction points this paper is trying to address.

Decoupling target localization and boundary delineation in remote sensing
Improving cross-modal alignment for vision-language interpretation
Enhancing segmentation accuracy in complex semantic scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples localization and segmentation stages
Integrates CLIP and SAM foundation models
Uses cascaded second-order prompter for precision
🔎 Similar Papers
2024-09-20IEEE Transactions on Geoscience and Remote SensingCitations: 2
K
Keyan Chen
Beihang University
C
Chenyang Liu
Beihang University
B
Bowen Chen
Beihang University
J
Jiafan Zhang
Beihang University
Zhengxia Zou
Zhengxia Zou
Beihang Univeristy
computer visionimage processingremote sensinggames
Z
Zhenwei Shi
Beihang University