Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation

📅 2025-03-14
🏛️ IEEE Geoscience and Remote Sensing Letters
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address coarse-grained multimodal alignment and insufficient feature fusion in referring segmentation of remote sensing images, this paper proposes a fine-grained cross-modal collaborative segmentation framework. The method introduces two key components: (1) a Correlation Fusion Module (CFM) that enables pixel-wise semantic alignment between textual and visual features via cross-modal correlation modeling; and (2) a Multi-Scale Refinement Convolution (MSRC) integrated with an adaptive noise-augmented Transformer-based visual encoder, which jointly captures multi-directional, multi-scale object structures and orientation invariance. Evaluated on the RRSIS-D benchmark, the proposed approach achieves significant improvements over existing state-of-the-art methods, attaining a 3.2% gain in mean Intersection-over-Union (mIoU). The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.
Problem

Research questions and friction points this paper is trying to address.

Segments objects in remote sensing images using text descriptions.
Improves multimodal feature fusion for better segmentation accuracy.
Enhances multi-scale visual features with adaptive noise in transformers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal-aware fusion network for fine-grained alignment
Correlation fusion module enhances multi-scale visual features
Multi-scale refinement convolution boosts object representation
🔎 Similar Papers
No similar papers found.
L
Leideng Shi
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201600, China
Juan Zhang
Juan Zhang
Department of Mathematics, Xiangtan University
Matrix ComputationNumerical AlgebraNumerical Algorithm