🤖 AI Summary
Locating mid-scale semantic entities (e.g., residential blocks, farmland clusters, industrial zones) in remote sensing imagery remains challenging due to their contextual dependence and lack of precise pixel-level supervision. Method: This paper proposes a novel cross-modal task—Contextual Reference Mapping (XeMap)—for text-driven, context-aware pixel-level referring localization. We design XeMap-Net, featuring a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that jointly leverages self-attention and cross-modal attention, trained via zero-shot cross-modal alignment to eliminate reliance on pixel-level annotations. Contribution/Results: Evaluated on our newly constructed XeMap-Set benchmark under zero-shot settings, our approach significantly outperforms existing state-of-the-art methods. It achieves the first text-to-pixel fine-grained contextual mapping for remote sensing scenes, establishing a new paradigm for large-scale Earth surface semantic understanding.
📝 Abstract
Advancements in remote sensing (RS) imagery have provided high-resolution detail and vast coverage, yet existing methods, such as image-level captioning/retrieval and object-level detection/segmentation, often fail to capture mid-scale semantic entities essential for interpreting large-scale scenes. To address this, we propose the conteXtual referring Map (XeMap) task, which focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes. Unlike traditional approaches, XeMap enables precise mapping of mid-scale semantic entities that are often overlooked in image-level or object-level methods. To achieve this, we introduce XeMap-Network, a novel architecture designed to handle the complexities of pixel-level cross-modal contextual referring mapping in RS. The network includes a fusion layer that applies self- and cross-attention mechanisms to enhance the interaction between text and image embeddings. Furthermore, we propose a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching across large-scale RS imagery. To support XeMap task, we provide a novel, annotated dataset, XeMap-set, specifically tailored for this task, overcoming the lack of XeMap datasets in RS imagery. XeMap-Network is evaluated in a zero-shot setting against state-of-the-art methods, demonstrating superior performance. This highlights its effectiveness in accurately mapping referring regions and providing valuable insights for interpreting large-scale RS environments.