VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing video geolocation methods are limited by coarse city-level granularity or reliance on large-scale global image databases that are difficult to construct, hindering fine-grained and scalable trajectory localization. This work proposes VidTAG, a novel framework that introduces denoising sequence prediction and temporal alignment mechanisms to enable high-precision trajectory reconstruction without requiring extensive image libraries. VidTAG employs a dual-encoder architecture that integrates self-supervised features with language–vision alignment signals, complemented by a TempGeo module for frame-level temporal alignment and a GeoRefiner module to refine GPS sequences. Evaluated on the Mapillary and GAMa datasets, the method outperforms GeoCLIP by 20% under a 1-kilometer accuracy threshold and surpasses state-of-the-art approaches by 25% on the CityGuessr68k global coarse-grained geolocation benchmark.

Technology Category

Application Category

📝 Abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

Problem

Research questions and friction points this paper is trying to address.

video geolocalization

GPS trajectory

fine-grained localization

global scale

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

video geolocalization

temporal alignment

dual-encoder framework