AMMNet: An Asymmetric Multi-Modal Network for Remote Sensing Semantic Segmentation

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address computational redundancy and cross-modal feature misalignment in RGB–DSM multimodal fusion for remote sensing semantic segmentation, this paper proposes an efficient asymmetric network architecture. Methodologically, it employs a deep RGB encoder coupled with a lightweight DSM encoder to form an asymmetric dual-encoder structure; introduces a modality-aware prior fusion module to enable guided cross-modal feature integration; and establishes a distribution alignment mechanism based on KL divergence minimization to explicitly enforce feature-space consistency. Evaluated on the ISPRS Vaihingen and Potsdam benchmarks, the method achieves state-of-the-art segmentation accuracy among multimodal approaches (mIoU of 87.3% and 85.6%, respectively), while reducing computational cost by 32% and GPU memory consumption by 41%. These improvements significantly enhance both efficiency and robustness in complex urban scenes.

Technology Category

Application Category

📝 Abstract
Semantic segmentation in remote sensing (RS) has advanced significantly with the incorporation of multi-modal data, particularly the integration of RGB imagery and the Digital Surface Model (DSM), which provides complementary contextual and structural information about the ground object. However, integrating RGB and DSM often faces two major limitations: increased computational complexity due to architectural redundancy, and degraded segmentation performance caused by modality misalignment. These issues undermine the efficiency and robustness of semantic segmentation, particularly in complex urban environments where precise multi-modal integration is essential. To overcome these limitations, we propose Asymmetric Multi-Modal Network (AMMNet), a novel asymmetric architecture that achieves robust and efficient semantic segmentation through three designs tailored for RGB-DSM input pairs. To reduce architectural redundancy, the Asymmetric Dual Encoder (ADE) module assigns representational capacity based on modality-specific characteristics, employing a deeper encoder for RGB imagery to capture rich contextual information and a lightweight encoder for DSM to extract sparse structural features. Besides, to facilitate modality alignment, the Asymmetric Prior Fuser (APF) integrates a modality-aware prior matrix into the fusion process, enabling the generation of structure-aware contextual features. Additionally, the Distribution Alignment (DA) module enhances cross-modal compatibility by aligning feature distributions through divergence minimization. Extensive experiments on the ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet attains state-of-the-art segmentation accuracy among multi-modal networks while reducing computational and memory requirements.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in RGB-DSM integration
Improves segmentation accuracy by aligning modalities
Enhances feature compatibility across different data types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Dual Encoder reduces architectural redundancy
Asymmetric Prior Fuser enables modality-aware fusion
Distribution Alignment minimizes feature divergence
🔎 Similar Papers
No similar papers found.
H
Hui Ye
School of Computer Science, The University of Sydney, Australia
H
Haodong Chen
School of Computer Science, The University of Sydney, Australia
Zeke Zexi Hu
Zeke Zexi Hu
University of Sydney
Computer VisionDeep LearningMachine Learning
X
Xiaoming Chen
School of Computer and Artificial Intelligence, Beijing Technology and Business University, China
Y
Yuk Ying Chung
School of Computer Science, The University of Sydney, Australia