MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing geospatial understanding benchmarks suffer from incomplete multimodal coverage and lack unified alignment across imagery (aerial/ground), textual descriptions, and GPS coordinates, hindering the development of cross-modal geospatial intelligence. To address this, we introduce Geo4M—the first quadruple-modality-aligned geospatial benchmark—comprising 18,557 U.S. landmarks with fine-grained, instance-level correspondence across satellite imagery, ground-level photos, descriptive text, and precise GPS coordinates. We propose a CLIP-inspired multimodal alignment training framework that jointly models cross-view visual features, geographic coordinates, and textual semantics. Our approach achieves significant improvements over unimodal and bimodal baselines on ground-to-satellite image retrieval and text-to-GPS localization. Critically, Geo4M enables, for the first time, systematic joint vision-language-geolocation tasks and cross-perspective semantic retrieval, thereby filling a critical gap in unified multimodal geospatial benchmarking.

Technology Category

Application Category

📝 Abstract
Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited multimodal coverage in geo-spatial benchmarks
Introduces a unified dataset with four modalities for landmarks
Enables training models for cross-view retrieval and geolocalization tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal dataset with four aligned modalities
CLIP-inspired baseline for cross-view geo-spatial tasks
One-to-one correspondence enabling unified multimodal training
🔎 Similar Papers
No similar papers found.
O
Oskar Kristoffersen
Technical University of Denmark, Pioneer Center for AI
A
Alba R. Sánchez
Technical University of Denmark, Pioneer Center for AI
M
Morten R. Hannemose
Technical University of Denmark, Pioneer Center for AI
A
Anders B. Dahl
Technical University of Denmark, Pioneer Center for AI
Dim P. Papadopoulos
Dim P. Papadopoulos
Associate Professor, Technical University of Denmark
Computer VisionMachine Learning