π€ AI Summary
This work addresses the performance degradation in cross-view geo-localization caused by viewpoint discrepancies, altitude variations, and weather disturbances. To tackle these challenges, the authors propose SkyPart, a method that introduces a plug-in head atop a vision transformer to discover and group image patches into semantic parts via learnable prototype-based competitive assignment, thereby disentangling layout from texture. Height-conditional modulation is further incorporated to mitigate the influence of altitude information on feature embeddings. The modelβs robustness and generalization are enhanced through a graph attention readout mechanism and a Kendall uncertainty-weighted multi-task loss. Experiments demonstrate that SkyPart achieves new state-of-the-art results on the SUES-200, University-1652, and DenseUAV benchmarks, significantly outperforming existing approaches under ten diverse weather perturbations in WeatherPrompt, while maintaining a compact model size of only 26.95 million parameters.
π Abstract
Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.