RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

šŸ“… 2025-04-04
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Remote sensing image interpretation has long been constrained by unimodal modeling, limiting effective fusion of complementary modalities—such as optical, SAR, and multispectral data—to mitigate ambiguity and uncertainty. To address this, we introduce the first general-purpose multimodal foundation model for remote sensing, with 14.7 billion parameters. Our method features a novel hierarchical Mixture-of-Experts (MoE) architecture, integrating physics-informed self-supervised pretraining, sensor-specific radiometric modeling, dynamic sparse activation, and joint contrastive–masked learning across heterogeneous sources. The model natively supports six fundamental tasks—including classification, detection, and segmentation—and establishes new state-of-the-art results on 23 benchmarks. Furthermore, it enables dynamic pruning: compressed to 1.0 billion parameters, it retains competitive performance. Deployed in real-world applications—including emergency response, marine monitoring, and urban planning—the model demonstrates strong practical utility and scalability.

Technology Category

Application Category

šŸ“ Abstract
The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap in multi-modal remote sensing foundation models
Handling diverse RS modalities with complementary insights
Enabling efficient deployment via dynamic expert pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Mixture-of-Experts architecture for multi-modal learning
Physics-informed self-supervised learning with radiometric characteristics
Dynamic expert pruning for adaptive model compression
šŸ”Ž Similar Papers
No similar papers found.
H
Hanbo Bi
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Yingchao Feng
Yingchao Feng
Aerospace Information Research Institute, Chinese Academy of Sciences
Machine learning in visionStatistical and structural pattern recognitionImage/video analysis and understandingRemote sensing image understandingMachine learning and data mining with applications to remote sensing
B
Boyuan Tong
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
M
Mengyu Wang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
H
Haichen Yu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Y
Yongqiang Mao
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Hao Chang
Hao Chang
Peking University
NeuroscienceGut brain axis
W
W. Diao
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Peijin Wang
Peijin Wang
Aerospace Information Research Institute, Chinese Academy of Sciences
foundation modelremote sensingdeep learning
Y
Yue Yu
Peng Cheng Laboratory, Shenzhen 518066, China
Hanyang Peng
Hanyang Peng
Peng Cheng Laboratory
Deep LearningOptimization
Y
Yehong Zhang
Peng Cheng Laboratory, Shenzhen 518066, China
K
Kun Fu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and also with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Xian Sun
Xian Sun
AerospaceĀ InformationĀ ResearchĀ Institute,Ā ChineseĀ AcademyĀ ofĀ Sciences
Remote SensingComputer Vision and Pattern RecognitionArtificial Intelligence