🤖 AI Summary
This work addresses the limited fine-grained spatial understanding of existing multimodal large language models (MLLMs) in remote sensing, which stems from small-scale datasets with coarse semantics and a lack of authoritative cadastral grounding. To overcome this, we present the first large-scale multimodal dataset aligning high-precision cadastral vector data with high-resolution remote sensing imagery, comprising 510,000 images and 3.8 million finely annotated objects across 135 semantic categories. Leveraging this dataset, we design seven spatial reasoning instruction-tuning tasks and demonstrate zero-shot inference capabilities within the standard LLaVA architecture. Experimental results show that our approach significantly outperforms both specialized remote sensing models and commercial large language models such as Gemini, underscoring the critical role of high-quality supervised data in enhancing the fine-grained spatial localization abilities of general-purpose MLLMs.
📝 Abstract
Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.