Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

πŸ“… 2026-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges in remote sensing referring expression segmentation, where insufficient textual description granularity and sensitivity to semantic shifts hinder effective cross-modal feature fusion. To overcome these limitations, the authors propose the Image-Conditioned Instance Prompt Network (ICIPNet), which introduces an Image-Conditioned Instance Prompt (ICIP) module to adaptively generate joint visual-semantic representations without relying on external knowledge. Additionally, a Bidirectional Information Fusion (BIF) mechanism is designed to dynamically align cross-modal features across both token and channel dimensions. The proposed approach significantly enhances object localization accuracy and outperforms existing models on remote sensing referring expression segmentation benchmarks, effectively alleviating the cross-modal fusion bottleneck.
πŸ“ Abstract
Referring Remote Sensing Image Segmentation (RRSIS) is a situated, task-driven cross-modal task related to the embodied perception paradigm, requiring models to align visual-spatial features with linguistic intentions for precise target perception. Recent research has focused on refining the granularity of textual features and optimizing image-text feature fusion to better guide target feature representations. However, insufficient descriptive granularity and sensitivity to semantic shifts can cause bottlenecks in cross-modal feature fusion. To address these issues, we propose the Image-Conditioned Instance Prompt Network (ICIPNet) with Bilateral Information Fusion, which is designed to alleviate bottlenecks in cross-modal feature fusion. ICIPNet introduces an Image-Conditioned Instance Prompt (ICIP) module to generate self-adaptive visual and semantic representations without external knowledge. The Bilateral Information Fusion (BIF) module enhances feature fusion along the token and channel dimensions. Experiments demonstrate that the proposed ICIPNet outperforms existing RRSIS models.
Problem

Research questions and friction points this paper is trying to address.

Referring Remote Sensing Image Segmentation
cross-modal feature fusion
semantic shift
descriptive granularity
visual-linguistic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-Conditioned Instance Prompt
Bilateral Information Fusion
Referring Remote Sensing Image Segmentation
Cross-modal Feature Fusion
Self-adaptive Representation
πŸ”Ž Similar Papers
No similar papers found.