🤖 AI Summary
Absolute visual localization of UAVs in GNSS-denied environments remains challenging due to poor matching robustness caused by cross-source (satellite-to-aerial) and cross-temporal image discrepancies.
Method: We propose a hierarchical cross-source image matching framework: first, semantic-guided coarse matching at the region level, leveraging vision foundation model features and geometric structure constraints; second, lightweight fine-grained feature alignment for pixel-level accurate matching—both stages operate without prior relative pose estimation.
Contribution/Results: We introduce the first “semantic-aware + structure-constrained” dual-granularity matching paradigm, enabling an end-to-end purely visual absolute localization pipeline. Evaluated on multiple public benchmarks and our newly constructed CS-UAV dataset, our method achieves significant improvements in localization accuracy and robustness—especially under challenging conditions including illumination variation, large viewpoint shifts, and seasonal changes.
📝 Abstract
Absolute localization, aiming to determine an agent's location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.