CSMoE: An Efficient Remote Sensing Foundation Model with Soft Mixture-of-Experts

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the prevalent issues of high computational overhead and weak representation capability in remote sensing foundation models (RS FMs), this paper proposes CSMoE—a novel efficient representation learning framework integrating a soft mixture-of-experts (Soft MoE) mechanism. Methodologically: (i) it introduces a hybrid architecture featuring modality-specific expert specialization alongside cross-sensor shared representation learning; (ii) it designs a topic- and climate-aware diverse data sampling strategy to enhance semantic coverage; and (iii) it employs a masked autoencoder for cross-sensor joint pretraining. Extensive experiments demonstrate that CSMoE achieves state-of-the-art accuracy on scene classification, semantic segmentation, and image retrieval tasks, while improving inference efficiency by 2.1× on average over existing RS FMs. This marks a breakthrough in the accuracy–efficiency trade-off for remote sensing foundation modeling.

Technology Category

Application Category

📝 Abstract

Self-supervised learning through masked autoencoders has attracted great attention for remote sensing (RS) foundation model (FM) development, enabling improved representation learning across diverse sensors and downstream tasks. However, existing RS FMs often either suffer from substantial computational complexity during both training and inference or exhibit limited representational capacity. These issues restrict their practical applicability in RS. To address this limitation, we propose an adaptation for enhancing the efficiency of RS FMs by integrating the Soft mixture-of-experts (MoE) mechanism into the FM. The integration of Soft MoEs into the FM allows modality-specific expert specialization alongside shared cross-sensor representation learning. To demonstrate the effectiveness of our adaptation, we apply it on the Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross-Sensor Mixture-of-Experts (CSMoE) model. In addition, we introduce a thematic-climatic descriptor-driven sampling strategy for the construction of a representative and diverse training set to train our CSMoE model. Extensive experiments on scene classification, semantic segmentation, and content-based image retrieval demonstrate that our adaptation yields a reduction in computational requirements while maintaining or improving representational performance. Compared to state-of-the-art RS FMs, CSMoE achieves a superior trade-off between representational capacity, accuracy, and computational efficiency. On average, CSMoE achieves more than twice the computational efficiency of existing RS FMs, while maintaining competitive performance across all experiments. These results show the effectiveness of the proposed adaptation for creating computationally efficient RS FMs. The code for the model, the training set creation, and the model weights will be available at https://git.tu-berlin.de/rsim/csmoe.

Problem

Research questions and friction points this paper is trying to address.

Addressing high computational complexity in remote sensing foundation models

Enhancing representational capacity while maintaining efficiency

Improving cross-sensor representation learning through specialized experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Soft Mixture-of-Experts mechanism

Uses thematic-climatic descriptor sampling strategy

Combines cross-sensor learning with expert specialization

🔎 Similar Papers

Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset