DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of high-definition (HD) maps and the limited alignment accuracy between bird’s-eye-view (BEV) visual features and standard maps (e.g., OpenStreetMap)—exacerbated by urban GPS multipath noise—this work pioneers the integration of diffusion models into visual localization. We propose a generative framework that leverages noisy GPS trajectories as a prior and jointly models BEV visual features, standard-accuracy vector maps, and GPS observations to iteratively denoise and estimate the posterior distribution of the true ego-vehicle pose. This approach eliminates reliance on HD maps, enabling a paradigm shift toward scalable, low-cost, high-precision localization. Evaluated on multiple public benchmarks, our method achieves sub-meter absolute localization accuracy—significantly outperforming existing BEV-based map-matching approaches—and demonstrates strong generalization and effectiveness across diverse urban environments.

Technology Category

Application Category

📝 Abstract
Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.
Problem

Research questions and friction points this paper is trying to address.

Denoise noisy GPS signals for visual localization
Reformulate localization as diffusion-based GPS refinement
Achieve sub-meter accuracy without HD maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models for GPS denoising
Conditions on visual BEV features and maps
Learns to reverse GPS noise perturbations
🔎 Similar Papers
No similar papers found.