XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Locating mid-scale semantic entities (e.g., residential blocks, farmland clusters, industrial zones) in remote sensing imagery remains challenging due to their contextual dependence and lack of precise pixel-level supervision. Method: This paper proposes a novel cross-modal task—Contextual Reference Mapping (XeMap)—for text-driven, context-aware pixel-level referring localization. We design XeMap-Net, featuring a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that jointly leverages self-attention and cross-modal attention, trained via zero-shot cross-modal alignment to eliminate reliance on pixel-level annotations. Contribution/Results: Evaluated on our newly constructed XeMap-Set benchmark under zero-shot settings, our approach significantly outperforms existing state-of-the-art methods. It achieves the first text-to-pixel fine-grained contextual mapping for remote sensing scenes, establishing a new paradigm for large-scale Earth surface semantic understanding.

Technology Category

Application Category

📝 Abstract
Advancements in remote sensing (RS) imagery have provided high-resolution detail and vast coverage, yet existing methods, such as image-level captioning/retrieval and object-level detection/segmentation, often fail to capture mid-scale semantic entities essential for interpreting large-scale scenes. To address this, we propose the conteXtual referring Map (XeMap) task, which focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes. Unlike traditional approaches, XeMap enables precise mapping of mid-scale semantic entities that are often overlooked in image-level or object-level methods. To achieve this, we introduce XeMap-Network, a novel architecture designed to handle the complexities of pixel-level cross-modal contextual referring mapping in RS. The network includes a fusion layer that applies self- and cross-attention mechanisms to enhance the interaction between text and image embeddings. Furthermore, we propose a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching across large-scale RS imagery. To support XeMap task, we provide a novel, annotated dataset, XeMap-set, specifically tailored for this task, overcoming the lack of XeMap datasets in RS imagery. XeMap-Network is evaluated in a zero-shot setting against state-of-the-art methods, demonstrating superior performance. This highlights its effectiveness in accurately mapping referring regions and providing valuable insights for interpreting large-scale RS environments.
Problem

Research questions and friction points this paper is trying to address.

Addressing mid-scale semantic entity localization in remote sensing
Enabling precise text-referred region mapping in large-scale scenes
Overcoming lack of datasets for contextual referring in RS imagery
Innovation

Methods, ideas, or system contributions that make the work stand out.

XeMap-Network for pixel-level cross-modal contextual referring
Hierarchical Multi-Scale Semantic Alignment module for precise matching
XeMap-set dataset tailored for contextual referring in RS
🔎 Similar Papers
No similar papers found.
Yuxi Li
Yuxi Li
Unknown affiliation
machine learningcomputer vision
L
Lu Si
Qiyuan Lab, Beijing 100095, China
Y
Yujie Hou
Qiyuan Lab, Beijing 100095, China
C
Chengaung Liu
Qiyuan Lab, Beijing 100095, China
B
Bin Li
Qiyuan Lab, Beijing 100095, China
H
Hongjian Fang
Qiyuan Lab, Beijing 100095, China
J
Jun Zhang
Qiyuan Lab, Beijing 100095, China