Spatially-Weighted CLIP for Street-View Geo-localization

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the limitations of conventional CLIP in street-view geolocation, where non-matching samples are uniformly treated as equally weighted negative examples, disregarding spatial autocorrelation and leading to suboptimal localization accuracy and long-tail errors. To overcome this, we propose Spatially-Weighted CLIP (SW-CLIP), which, for the first time, integrates Tobler’s First Law of Geography into multimodal contrastive learning. SW-CLIP replaces hard negative labels with geodesic distance–derived spatially weighted soft labels and introduces a neighborhood consistency regularizer to preserve local geographic structure. Built upon the CLIP architecture and incorporating positional text encoding with a refined InfoNCE loss, SW-CLIP significantly improves geolocation accuracy across multiple city-scale datasets, effectively mitigates long-tail issues, and demonstrates that geography-aware alignment outperforms purely semantic alignment.

Technology Category

Application Category

📝 Abstract

This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.

Problem

Research questions and friction points this paper is trying to address.

geo-localization

spatial autocorrelation

CLIP

vision-language learning

geographic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial autocorrelation

vision-language contrastive learning

soft label supervision