DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In GNSS-denied environments, UAV localization suffers from significant geometric and appearance domain shifts between satellite imagery and oblique aerial views; existing methods rely on complex architectures, textual prompts, or extensive annotations, limiting generalizability. This paper proposes a vision-prompted cross-view localization framework: we introduce the first image-conditioned, text-free diffusion model, integrated with a VAE for unified representation learning and a training-free geometric renderer to synthesize pseudo-satellite images. Cross-view matching is achieved efficiently via fixed-timestep feature extraction and cosine similarity. Our method requires no textual prompts, minimal annotations, and avoids intricate network designs. Evaluated on University-1652 and SUES-200, it achieves state-of-the-art performance—particularly robust in satellite-to-UAV retrieval—demonstrating strong effectiveness and generalization.

Technology Category

Application Category

📝 Abstract
With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses UAV localization failure in GNSS-denied environments
Bridges geometric and appearance gaps between UAV and satellite views
Reduces dependency on text prompts and complex annotation requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free geometric rendering for pseudo-satellite images
Text-free conditional diffusion model for feature learning
VAE-based unified representation with cosine similarity comparison
🔎 Similar Papers
No similar papers found.
T
Tao Liu
School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, 210094, Jiangsu, China
Kan Ren
Kan Ren
Assistant Professor, ShanghaiTech University
Machine LearningData MiningLarge Language ModelFoundation Model
Q
Qian Chen
School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, 210094, Jiangsu, China