🤖 AI Summary
In the foundation model (FM) era, classical feature aggregation methods have been overlooked, and cross-dataset training remains inconsistent. To address these issues, this work revisits and enhances the GeM and NetVLAD paradigms. We propose a supervised label alignment framework for joint training across multiple Visual Place Recognition (VPR) datasets; introduce a dual-GeM architecture (G²M) to improve channel-wise feature calibration; and pioneer a two-stage fine-tuning strategy (FT²) for NetVLAD, along with a NetVLAD-Linear compression module. Experiments demonstrate that G²M achieves state-of-the-art performance at merely 1/10 the embedding dimension; NVL-FT² ranks first on the MSLS leaderboard; and our methods consistently outperform existing FM-driven approaches across multiple benchmarks. Our core contribution lies in revitalizing classical aggregation methods—establishing an FM-compatible, efficient, and robust VPR paradigm grounded in principled feature aggregation.
📝 Abstract
Recent visual place recognition (VPR) approaches have leveraged foundation models (FM) and introduced novel aggregation techniques. However, these methods have failed to fully exploit key concepts of FM, such as the effective utilization of extensive training sets, and they have overlooked the potential of classical aggregation methods, such as GeM and NetVLAD. Building on these insights, we revive classical feature aggregation methods and develop more fundamental VPR models, collectively termed SuperPlace. First, we introduce a supervised label alignment method that enables training across various VPR datasets within a unified framework. Second, we propose G$^2$M, a compact feature aggregation method utilizing two GeMs, where one GeM learns the principal components of feature maps along the channel dimension and calibrates the output of the other. Third, we propose the secondary fine-tuning (FT$^2$) strategy for NetVLAD-Linear (NVL). NetVLAD first learns feature vectors in a high-dimensional space and then compresses them into a lower-dimensional space via a single linear layer. Extensive experiments highlight our contributions and demonstrate the superiority of SuperPlace. Specifically, G$^2$M achieves promising results with only one-tenth of the feature dimensions compared to recent methods. Moreover, NVL-FT$^2$ ranks first on the MSLS leaderboard.