Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of establishing geometric correspondences between ground-level and aerial views—a key difficulty in cross-view geolocalization and image synthesis—by proposing a unified framework named Geo². Geo² leverages a geometric foundation model (e.g., VGGT) to extract 3D geometric priors and introduces GeoMap to embed multi-view features into a shared 3D-aware latent space. The framework further incorporates a GeoFlow flow-matching module and a bidirectional consistency loss, enabling the first joint optimization of geolocalization and bidirectional cross-view image synthesis. Evaluated on the CVUSA, CVACT, and VIGOR benchmarks, Geo² achieves state-of-the-art performance on both tasks simultaneously, demonstrating its effectiveness in bridging the domain gap between disparate viewpoints through explicit geometric reasoning.
📝 Abstract
Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.
Problem

Research questions and friction points this paper is trying to address.

Cross-View Geo-Localization
Cross-View Image Synthesis
Geometric Foundation Models
Geo-spatial Learning
Viewpoint Discrepancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric Foundation Models
Cross-View Geo-Localization
Cross-View Image Synthesis
3D-aware Latent Space
Flow Matching
🔎 Similar Papers
No similar papers found.