🤖 AI Summary
This work addresses the significant performance degradation in remote sensing multimodal semantic segmentation when modalities such as optical, SAR, or DSM are missing, a challenge exacerbated by feature collapse and over-generalized feature recovery in existing methods. To tackle this, the authors propose the STARS framework, which employs a shared-specific translation mechanism coupled with an asymmetric alignment strategy. Specifically, bidirectional modality translation with gradient stopping is introduced to prevent feature collapse, while a pixel-level semantic alignment (PSA) strategy—integrating class-balanced sampling and a cross-modal semantic alignment loss—is designed to mitigate alignment failure caused by class imbalance. Experiments demonstrate that the proposed method substantially improves segmentation accuracy under missing-modality conditions, particularly enhancing recognition of minority classes, while also reducing sensitivity to hyperparameters and exhibiting superior robustness and generalization capability.
📝 Abstract
Multimodal remote sensing technology significantly enhances the understanding of surface semantics by integrating heterogeneous data such as optical images, Synthetic Aperture Radar (SAR), and Digital Surface Models (DSM). However, in practical applications, the missing of modality data (e.g., optical or DSM) is a common and severe challenge, which leads to performance decline in traditional multimodal fusion models. Existing methods for addressing missing modalities still face limitations, including feature collapse and overly generalized recovered features. To address these issues, we propose \textbf{STARS} (\textbf{S}hared-specific \textbf{T}ranslation and \textbf{A}lignment for missing-modality \textbf{R}emote \textbf{S}ensing), a robust semantic segmentation framework for incomplete multimodal inputs. STARS is built on two key designs. First, we introduce an asymmetric alignment mechanism with bidirectional translation and stop-gradient, which effectively prevents feature collapse and reduces sensitivity to hyperparameters. Second, we propose a Pixel-level Semantic sampling Alignment (PSA) strategy that combines class-balanced pixel sampling with cross-modality semantic alignment loss, to mitigate alignment failures caused by severe class imbalance and improve minority-class recognition.