EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limitations of existing geospatial reasoning benchmarks for Earth imagery, which lack support for quantitative distance/direction inference, systematic topological relations, and complex geometric primitives such as polygons and polylines. To bridge this gap, we introduce a comprehensive multimodal large language model evaluation benchmark tailored for Earth imagery, featuring coordinate-based geometric representations—including bounding boxes, polylines, and polygons—for the first time. The benchmark integrates visual coverage, textual descriptions, and explicit geometric coordinates, comprising 325,000 question-answer pairs that support queries at the levels of individual objects, object pairs, and aggregated groups. Experiments on leading multimodal foundation models reveal critical shortcomings in current approaches to spatial understanding in Earth imagery, thereby establishing a much-needed evaluation framework for this domain.

Technology Category

Application Category

📝 Abstract

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

Earth imagery

multimodal LLMs

topological relations

quantitative distance

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial reasoning

Earth imagery

multimodal LLMs