ðĪ AI Summary
This work addresses the limitation of existing remote sensing visual localization methods, which are confined to single-sensor modalities (e.g., optical or SAR) and thus struggle in cross-domain real-world applications. To bridge this gap, we introduce the first cross-domain remote sensing visual localization task, establish the large-scale OptSAR-RSVG benchmark dataset, and propose OptiSAR-Net++, an efficient Transformer-free framework. Our method incorporates a low-rank adaptive mixture-of-experts module (PL-MoE) to disentangle cross-modal features, reformulates generative regression as a contrastive matching paradigm, and integrates CLIP-based contrastive learning, dynamic adversarial negative sampling, and a text-guided dual-gated fusion module (TGDF-SSA). Extensive experiments on both OptSAR-RSVG and DIOR-RSVG demonstrate state-of-the-art performance, significantly outperforming existing approaches in both localization accuracy and computational efficiency.
ð Abstract
Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.