RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing remote sensing referring segmentation methods suffer from error propagation under semantic ambiguity and limited generalizability and interpretability due to the tight coupling of localization and segmentation. To address this, we propose RSRefSeg 2, the first decoupled two-stage framework: (1) a cross-modal coarse localization stage leveraging CLIP, coupled with cascaded second-order text prompting to precisely activate multi-entity semantics; and (2) a fine-grained pixel-level segmentation stage guided by semantic prompts using an enhanced Segment Anything Model (SAM), enabling end-to-end vision-language alignment. This design significantly improves model interpretability and out-of-distribution generalization. Evaluated on RefSegRS, RRSIS-D, and RISBench, RSRefSeg 2 achieves an average gain of ~3% in grounding IoU (gIoU), with particularly strong performance on complex semantic understanding tasks.

Technology Category

Application Category

📝 Abstract

Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP's cross-modal alignment strength with SAM's segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP's misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.

Problem

Research questions and friction points this paper is trying to address.

Decoupling target localization and boundary delineation in remote sensing

Improving cross-modal alignment for vision-language interpretation

Enhancing segmentation accuracy in complex semantic scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples localization and segmentation stages

Integrates CLIP and SAM foundation models

Uses cascaded second-order prompter for precision

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation