M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient fusion and computational inefficiency arising from cross-modal semantic inconsistency in multimodal remote sensing image classification, this paper proposes an end-to-end CLIP-driven Mamba fusion framework. We innovatively design CLIP-guided modality-specific adapters to achieve lightweight fine-tuning and cross-modal semantic alignment. Furthermore, we construct a linear-complexity multimodal Mamba backbone and introduce a Cross-SS2D cross-attention module to enhance dynamic inter-modal interaction. Evaluated on hyperspectral remote sensing classification tasks, our method achieves an average accuracy improvement of ≥5.98% over state-of-the-art baselines, while significantly accelerating training—effectively balancing accuracy and efficiency. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal fusion for remote sensing classification
Improves semantic consistency and computational efficiency
Integrates CLIP-driven adapters and Mamba architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-driven modality-specific adapters enhance semantic understanding
Multi-modal Mamba fusion with linear complexity
Cross-SS2D module enables efficient cross-modal interaction
🔎 Similar Papers
No similar papers found.
M
Mingxiang Cao
State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China
Weiying Xie
Weiying Xie
Xidian University
remote image processingdeep learningtarget detectionanomaly detection
X
Xin Zhang
State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China
Jiaqing Zhang
Jiaqing Zhang
University of Science and Technology of China
Recommender SystemData-Centric AI
Kai Jiang
Kai Jiang
School of Mathematics and Computational Science
Quasiperiodic SystemsApplied Mathematics & Computational Mathematics
Jie Lei
Jie Lei
Universitat Politècnica de València
Computer EngineeringElectronic engineering
Y
Yunsong Li
State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China