Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient multimodal information utilization in few-shot remote sensing scene classification (FS-RSSC), this paper proposes an Optimal Transport (OT)-based cross-modal unified representation learning framework. Methodologically, it integrates cross-modal attention with lightweight adapter fine-tuning. Its key contributions are: (1) the first Optimal Transport Adapter (OTA), which aligns and fuses visual and textual modalities under the OT metric; and (2) an Entropy-Aware Weighted (EAW) loss, which incorporates entropy regularization to constrain similarity learning, thereby enhancing OT optimization stability and balancing cross-modal interaction. Evaluated on mainstream remote sensing benchmarks—including UC-Merced and AID—the method achieves state-of-the-art (SOTA) performance, demonstrating significant improvements in few-shot generalization capability and model robustness.

Technology Category

Application Category

📝 Abstract
Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.
Problem

Research questions and friction points this paper is trying to address.

Classify remote sensing images with limited labeled samples
Optimize multi-modal representations for better information transfer
Enhance model performance and generalization in few-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport Adapter harmonizes visual-textual data
Entropy-Aware Weighted loss enhances optimization stability
Cross-modal attention enriches textual representations effectively
Zhong Ji
Zhong Ji
tianjin university
multimedia understandingcross-modal learningzero/few-shot learning
C
Ci Liu
School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-inspired Intelligence Technology, Tianjin University, Tianjin 300072 , China
Jingren Liu
Jingren Liu
PhD student, Tianjin University
Continual LearningLong-form Video UnderstandingUnified Models
C
Chen Tang
School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-inspired Intelligence Technology, Tianjin University, Tianjin 300072 , China
Yanwei Pang
Yanwei Pang
Tianjin University
Computer VisionImage ProcessingPattern RecognitionMachine Learning
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd, Beijing 100033 , China