🤖 AI Summary
Remote sensing multimodal modeling has long suffered from a functional dichotomy between dual-encoder retrieval models—lacking fine-grained spatial reasoning—and generative-auxiliary models—suffering from poor scalability. This paper introduces RSME, the first unified single-encoder multimodal embedding model jointly processing images, text, bounding boxes, and geographic coordinates. RSME pioneers an interleaved input architecture that seamlessly integrates cross-modal representation learning with region-level spatial reasoning. We further establish RSMEB, a comprehensive remote sensing embedding benchmark covering six fine-grained geovisual tasks. Leveraging contrastive joint embedding, instruction tuning, and explicit geographic coordinate encoding, RSME achieves P@1 scores of 26.6%, 32.5%, and 17.8% on region–description retrieval, referring expression grounding, and semantic geolocation, respectively—matching or surpassing task-specific models. This work significantly advances general-purpose multimodal understanding in remote sensing.
📝 Abstract
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $ extbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $ extbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $ extbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $ extbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $ extbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3 imes$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.