🤖 AI Summary
This work addresses the challenges of cross-modal fine-grained object retrieval between optical and synthetic aperture radar (SAR) images under unaligned conditions, where large modality discrepancies, strong speckle noise, and structural inconsistencies hinder performance. To tackle these issues, the authors propose GeoMamba, a novel framework built upon the MambaVision architecture that introduces two key components: a Geometry Feature Injection (GFI) module and a Geometry Consistency Constraint (GCC) module. These modules leverage classical geometric operators and deep supervision to enable structure-aware cross-modal feature interaction. Additionally, the study presents FGOS-as, the first dataset specifically designed for fine-grained cross-modal retrieval in unaligned optical–SAR scenarios. Experimental results demonstrate that GeoMamba achieves 63.3% mAP and 77.0% Rank-1 accuracy on this dataset, significantly outperforming existing methods.
📝 Abstract
Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.