Spatially-Weighted CLIP for Street-View Geo-localization

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional CLIP in street-view geolocation, where non-matching samples are uniformly treated as equally weighted negative examples, disregarding spatial autocorrelation and leading to suboptimal localization accuracy and long-tail errors. To overcome this, we propose Spatially-Weighted CLIP (SW-CLIP), which, for the first time, integrates Tobler’s First Law of Geography into multimodal contrastive learning. SW-CLIP replaces hard negative labels with geodesic distance–derived spatially weighted soft labels and introduces a neighborhood consistency regularizer to preserve local geographic structure. Built upon the CLIP architecture and incorporating positional text encoding with a refined InfoNCE loss, SW-CLIP significantly improves geolocation accuracy across multiple city-scale datasets, effectively mitigates long-tail issues, and demonstrates that geography-aware alignment outperforms purely semantic alignment.
📝 Abstract
This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.
Problem

Research questions and friction points this paper is trying to address.

geo-localization
spatial autocorrelation
CLIP
vision-language learning
geographic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial autocorrelation
vision-language contrastive learning
soft label supervision
geodesic distance
neighborhood consistency
🔎 Similar Papers
No similar papers found.
Ting Han
Ting Han
Sun Yat-sen University
point cloudremote sensing
F
Fengjiao Li
School of Geographical and Earth Sciences, University of Glasgow, Glasgow, United Kingdom
C
Chunsong Chen
School of Geospatial Engineering and Science, Sun Yat-Sen University, Zhuhai, China
H
Haoling Huang
School of System Science and Engineering, Sun Yat-Sen University, Guangzhou, China
Yiping Chen
Yiping Chen
Sun Yat-sen University
Point CloudsMobile MappingGeomaticsLiDAR3D Vision
Meiliu Wu
Meiliu Wu
University of Glasgow
Geospatial Data ScienceGeoAIUrban AnalyticsEnvironmental Sustainability