Referring Remote Sensing Image Segmentation via Bidirectional Alignment Guided Joint Prediction

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Addressing challenges in remote sensing image referring segmentation—including weak visual-linguistic modality alignment, poor localization of small objects, ambiguous object boundaries, and multi-scale interference—this paper proposes a novel cross-modal segmentation framework. Methodologically, it integrates multi-scale feature interaction, contrastive spatial correlation computation, and joint vision-language prediction. Key contributions include: (1) bidirectional spatial correlation modeling to enhance fine-grained vision-language alignment; (2) a target-background dual-stream decoder to improve discriminability between foreground and background; and (3) a dual-modality object learning strategy to strengthen semantic consistency. Evaluated on RefSegRS and RRSIS-D, the method achieves state-of-the-art performance with overall IoU scores of 80.57% and 79.23%, surpassing prior best methods by 3.76 and 1.44 percentage points, respectively; mean IoU improves by 5.37 and 1.84 points.

Technology Category

Application Category

📝 Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that ours achieves state-of-the-art performance. Specifically, ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.

Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language gap in remote sensing

Enhancing multi-scale feature interaction

Improving fine-grained object differentiation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Spatial Correlation alignment

Target-Background TwinStream Decoder

Dual-Modal Object Learning Strategy

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation