🤖 AI Summary
To address the limitations of existing cross-view geo-localization methods for UAVs in GNSS-denied environments—namely, their reliance on large-scale paired UAV-satellite image datasets, high data acquisition costs, and poor generalizability—this paper proposes the first self-supervised, satellite-only cross-view localization framework. Our method models the visual domain shift between satellite and UAV viewpoints without requiring paired samples, enabling a contrastive self-supervised learning paradigm. We introduce CAEVL, a lightweight contrastive autoencoder architecture, coupled with viewpoint-aware data augmentation strategies tailored to inter-modal geometric and appearance discrepancies. Evaluated on the newly released ViLD real-world UAV dataset, our approach achieves localization accuracy comparable to fully supervised paired-training baselines—even under zero-shot pairing—while significantly enhancing generalization in low-resource settings. This work establishes an efficient, scalable, and GNSS-free visual localization paradigm.
📝 Abstract
Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.