🤖 AI Summary
To address the high computational cost and insufficient robustness of existing descriptor-level fusion methods in multi-reference visual place recognition (VPR)—particularly under cross-appearance/viewpoint variations and multi-sensor scenarios—this paper proposes a training-free, descriptor-agnostic matrix decomposition framework. Our method jointly models multi-condition reference descriptors to learn a shared basis representation and condition-specific residual components, enabling efficient residual projection matching. This work is the first to introduce matrix decomposition into multi-reference VPR, supporting arbitrary pre-trained descriptors while ensuring lightweight inference and strong generalization. Evaluated on the structured multi-view SotonMV benchmark and unstructured datasets, our approach achieves ~18% higher Recall@1 over single-reference baselines and ~5% improvement over state-of-the-art multi-reference methods, significantly enhancing localization robustness and practicality under complex appearance and viewpoint changes.
📝 Abstract
We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.