🤖 AI Summary
To address the challenging cross-view matching between GPS-unlabeled ground-level images and satellite imagery—particularly under multi-field-of-view (FoV) conditions where localization accuracy degrades and visual misinformation is difficult to trace—this paper proposes SAN-QUAD, a novel four-stream Siamese network. SAN-QUAD is the first method to incorporate semantic segmentation features for fine-grained semantic alignment between ground and satellite views, overcoming limitations of conventional approaches reliant solely on geometric constraints or appearance similarity. By jointly modeling scene structural semantics and viewpoint-invariant representations, it significantly enhances cross-modal matching robustness. Experiments on the CVUSA subset demonstrate a 9.8% improvement over state-of-the-art methods; notably, localization accuracy is substantially improved under multi-FoV settings. The approach establishes a new, interpretable, and high-precision geolocation paradigm with direct applicability to visual misinformation detection.
📝 Abstract
The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.