M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address case-specific studies in RGB-Thermal (RGBT) multispectral vision—leading to human-induced inductive bias, modality bias, and data scarcity—this paper introduces the first general-purpose multispectral foundation model for RGBT. Methodologically, we propose Cross-Modal Structural Sparsity (CMSS) as a novel metric, integrated with a Gaussian Mixture Model-guided progressive masking strategy (GMM-CMSS), enabling object-centric, difficulty-ordered self-supervised pretraining that mitigates information imbalance and prior dependency. The model learns invariant representations unifying RGB and thermal modalities, breaking the task-fragmentation paradigm. Extensive evaluation across four downstream tasks and eleven benchmark datasets demonstrates significant improvements in performance consistency and cross-scenario robustness. This work establishes a scalable, unified architectural foundation for multispectral perception.

Technology Category

Application Category

📝 Abstract

RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.

Problem

Research questions and friction points this paper is trying to address.

Overcoming manual model limitations in RGBT tasks

Learning modality-invariant representations from broad data

Addressing information imbalance in RGBT multispectral fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for modality-invariant representations

Cross-Modality Structural Sparsity (CMSS) metric

GMM-CMSS progressive masking strategy

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)