🤖 AI Summary
To address the performance degradation in Visual Place Recognition (VPR) caused by significant domain shifts between training and testing distributions, this paper proposes a lightweight, test-time reference-set-driven fine-tuning method. Prior to inference, the method performs a single-step adaptive optimization of a vision foundation model backbone using a small set of target-domain reference images with known poses. Crucially, it leverages the test-time map—i.e., the available reference imagery and its geometric metadata—as an implicit domain adaptation signal, eliminating the need for additional annotations or architectural modifications. This approach effectively bridges domain gaps while preserving the model’s generalization capability. Evaluated on multiple challenging cross-domain benchmarks, the method achieves an average +2.3% improvement in Recall@1 over state-of-the-art methods, significantly enhancing their robustness and practical applicability.
📝 Abstract
Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the "map", contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.