Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

๐Ÿ“… 2026-03-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of high computational cost and modality imbalance in multimodal remote sensing semantic segmentation, which often suppress the contribution of auxiliary modalities. To this end, the authors propose MoBaNet, a symmetric dual-stream framework built upon frozen vision foundation models. The architecture incorporates a Cross-Modal Prompt Injection Adapter (CPIA) to enhance inter-modal interaction, a Discrepancy-Guided Gated Fusion Module (DGFM) for efficient and balanced feature fusion, and a Modality-Conditioned Random Masking (MCRM) strategy combined with hard pixel-wise supervision to improve training efficiency. Evaluated on the ISPRS Vaihingen and Potsdam datasets, MoBaNet achieves state-of-the-art performance while significantly reducing the number of trainable parameters compared to full fine-tuning approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
Problem

Research questions and friction points this paper is trying to address.

multimodal remote sensing
semantic segmentation
modality imbalance
parameter efficiency
Vision Foundation Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient
modality-balanced
symmetric fusion
cross-modal prompt
frozen foundation model
๐Ÿ”Ž Similar Papers
No similar papers found.
Haocheng Li
Haocheng Li
The Chinese University of Hong Kong
VLSI CAD
J
Juepeng Zheng
School of Artificial Intelligence, Sun Yat-Sen University, Zhuhai, China
S
Shuangxi Miao
College of Land Science and Technology, China Agricultural University; Key Laboratory of Remote Sensing for Agri-Hazards, Ministry of Agriculture and Rural Affairs, Beijing 100083, China
R
Ruibo Lu
Henan Polytechnic University; Key Laboratory of Spatio-Temporal Information and Ecological Restoration of Mines, Ministry of Natural Resources of the Peopleโ€™s Republic of China, Jiaozuo 454003, China
G
Guosheng Cai
Henan Polytechnic University; Key Laboratory of Spatio-Temporal Information and Ecological Restoration of Mines, Ministry of Natural Resources of the Peopleโ€™s Republic of China, Jiaozuo 454003, China
Haohuan Fu
Haohuan Fu
Tsinghua University
Jianxi Huang
Jianxi Huang
Professor in China Agricultural University
Data assimilationClimate changeAgricultural remote sensingCrop modeling with remote sensing data assimilationCrop yield