L-MCAT: Unpaired Multimodal Transformer with Contrastive Attention for Label-Efficient Satellite Image Classification

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address few-label remote sensing image classification, this paper proposes a self-supervised cross-modal alignment method leveraging unpaired multimodal satellite data (e.g., SAR and optical). The method introduces a lightweight Transformer architecture featuring modality-spectral adapters and an unpaired multimodal attention mechanism—enabling cross-modal feature alignment without pixel-level registration or paired annotations. Integrated with contrastive attention and end-to-end self-supervised training, it supports efficient single-GPU optimization. On the SEN12MS benchmark, the approach achieves 95.4% classification accuracy using only 20 labeled samples per class. It reduces model parameters and computational cost by 47× and 23×, respectively, compared to baseline architectures. Moreover, it maintains over 92% accuracy under 50% spatial misalignment, demonstrating substantial robustness and practical applicability for real-world remote sensing scenarios.

Technology Category

Application Category

📝 Abstract
We propose the Lightweight Multimodal Contrastive Attention Transformer (L-MCAT), a novel transformer-based framework for label-efficient remote sensing image classification using unpaired multimodal satellite data. L-MCAT introduces two core innovations: (1) Modality-Spectral Adapters (MSA) that compress high-dimensional sensor inputs into a unified embedding space, and (2) Unpaired Multimodal Attention Alignment (U-MAA), a contrastive self-supervised mechanism integrated into the attention layers to align heterogeneous modalities without pixel-level correspondence or labels. L-MCAT achieves 95.4% overall accuracy on the SEN12MS dataset using only 20 labels per class, outperforming state-of-the-art baselines while using 47x fewer parameters and 23x fewer FLOPs than MCTrans. It maintains over 92% accuracy even under 50% spatial misalignment, demonstrating robustness for real-world deployment. The model trains end-to-end in under 5 hours on a single consumer GPU.
Problem

Research questions and friction points this paper is trying to address.

Label-efficient satellite image classification with unpaired multimodal data
Aligning heterogeneous modalities without pixel-level correspondence or labels
Achieving high accuracy with minimal labels and computational resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Spectral Adapters compress sensor inputs
Unpaired Multimodal Attention Alignment aligns modalities
Lightweight transformer with contrastive self-supervised learning
🔎 Similar Papers
No similar papers found.