SuperPlace: The Renaissance of Classical Feature Aggregation for Visual Place Recognition in the Era of Foundation Models

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

In the foundation model (FM) era, classical feature aggregation methods have been overlooked, and cross-dataset training remains inconsistent. To address these issues, this work revisits and enhances the GeM and NetVLAD paradigms. We propose a supervised label alignment framework for joint training across multiple Visual Place Recognition (VPR) datasets; introduce a dual-GeM architecture (G²M) to improve channel-wise feature calibration; and pioneer a two-stage fine-tuning strategy (FT²) for NetVLAD, along with a NetVLAD-Linear compression module. Experiments demonstrate that G²M achieves state-of-the-art performance at merely 1/10 the embedding dimension; NVL-FT² ranks first on the MSLS leaderboard; and our methods consistently outperform existing FM-driven approaches across multiple benchmarks. Our core contribution lies in revitalizing classical aggregation methods—establishing an FM-compatible, efficient, and robust VPR paradigm grounded in principled feature aggregation.

Technology Category

Application Category

📝 Abstract

Recent visual place recognition (VPR) approaches have leveraged foundation models (FM) and introduced novel aggregation techniques. However, these methods have failed to fully exploit key concepts of FM, such as the effective utilization of extensive training sets, and they have overlooked the potential of classical aggregation methods, such as GeM and NetVLAD. Building on these insights, we revive classical feature aggregation methods and develop more fundamental VPR models, collectively termed SuperPlace. First, we introduce a supervised label alignment method that enables training across various VPR datasets within a unified framework. Second, we propose G$^2$M, a compact feature aggregation method utilizing two GeMs, where one GeM learns the principal components of feature maps along the channel dimension and calibrates the output of the other. Third, we propose the secondary fine-tuning (FT$^2$) strategy for NetVLAD-Linear (NVL). NetVLAD first learns feature vectors in a high-dimensional space and then compresses them into a lower-dimensional space via a single linear layer. Extensive experiments highlight our contributions and demonstrate the superiority of SuperPlace. Specifically, G$^2$M achieves promising results with only one-tenth of the feature dimensions compared to recent methods. Moreover, NVL-FT$^2$ ranks first on the MSLS leaderboard.

Problem

Research questions and friction points this paper is trying to address.

Reviving classical feature aggregation for VPR

Enhancing FM utilization in VPR models

Improving feature dimension efficiency in VPR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised label alignment for unified VPR training

G$^2$M: Dual GeM for compact feature aggregation

NetVLAD-Linear with secondary fine-tuning (FT$^2$)

🔎 Similar Papers

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition