🤖 AI Summary
Existing video geolocation methods are limited by coarse city-level granularity or reliance on large-scale global image databases that are difficult to construct, hindering fine-grained and scalable trajectory localization. This work proposes VidTAG, a novel framework that introduces denoising sequence prediction and temporal alignment mechanisms to enable high-precision trajectory reconstruction without requiring extensive image libraries. VidTAG employs a dual-encoder architecture that integrates self-supervised features with language–vision alignment signals, complemented by a TempGeo module for frame-level temporal alignment and a GeoRefiner module to refine GPS sequences. Evaluated on the Mapillary and GAMa datasets, the method outperforms GeoCLIP by 20% under a 1-kilometer accuracy threshold and surpasses state-of-the-art approaches by 25% on the CityGuessr68k global coarse-grained geolocation benchmark.
📝 Abstract
The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/