🤖 AI Summary
Under challenging conditions such as drastic illumination changes and image blur, scene coordinate regression (SCR) methods suffer significantly lower localization accuracy compared to feature-matching-based approaches. To address this, we propose a robust visual localization framework that requires neither 3D ground-truth supervision nor model ensembling. Our method introduces three key innovations: (1) co-visibility graph modeling for global contextual encoding; (2) a depth-adaptive reprojection loss to enhance implicit triangulation capability; and (3) a lightweight CNN-based local feature extractor integrated into an end-to-end optimization architecture. Evaluated on the Aachen Day-Night dataset, our approach achieves state-of-the-art performance—improving localization accuracy by an order of magnitude over existing SCR methods—while compressing the map size to only one-fifth of comparable methods without sacrificing accuracy.
📝 Abstract
Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$ imes$ more accurate than previous SCR methods with similar map sizes and require at least 5$ imes$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available at: https://github.com/cvg/scrstudio .