🤖 AI Summary
To address coarse-grained multimodal alignment and insufficient feature fusion in referring segmentation of remote sensing images, this paper proposes a fine-grained cross-modal collaborative segmentation framework. The method introduces two key components: (1) a Correlation Fusion Module (CFM) that enables pixel-wise semantic alignment between textual and visual features via cross-modal correlation modeling; and (2) a Multi-Scale Refinement Convolution (MSRC) integrated with an adaptive noise-augmented Transformer-based visual encoder, which jointly captures multi-directional, multi-scale object structures and orientation invariance. Evaluated on the RRSIS-D benchmark, the proposed approach achieves significant improvements over existing state-of-the-art methods, attaining a 3.2% gain in mean Intersection-over-Union (mIoU). The source code is publicly available.
📝 Abstract
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.