Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Transformer-based visual place recognition (VPR), conventional approaches rely on explicit aggregation modules to fuse patch-level features into a global descriptor. This work challenges the necessity of such dedicated aggregators and proposes an **implicit aggregation paradigm**: learnable aggregation tokens are embedded directly into the backbone Transformer, co-participating with image patch tokens in self-attention computation; global feature integration emerges end-to-end through token-level interactions, with the final aggregation token serving as the compact global descriptor. We systematically investigate optimal insertion layers and initialization strategies for these tokens. Evaluated on multiple standard VPR benchmarks, our method surpasses state-of-the-art approaches in accuracy while achieving higher inference efficiency. It further attains top performance in the MSLS challenge.

Technology Category

Application Category

📝 Abstract
Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.
Problem

Research questions and friction points this paper is trying to address.

Eliminating dedicated aggregators for robust visual place recognition
Developing implicit aggregation via transformer tokens
Optimizing token insertion strategy and initialization method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable aggregation tokens in transformers
Implicitly aggregates information via self-attention
Proposes token insertion strategy and initialization method
🔎 Similar Papers
No similar papers found.
F
Feng Lu
Tsinghua Shenzhen International Graduate School, Tsinghua University
T
Tong Jin
Shenyang Institute of Automation, Chinese Academy of Sciences
C
Canming Ye
Tsinghua Shenzhen International Graduate School, Tsinghua University
Yunpeng Liu
Yunpeng Liu
Wuhan University of Technology
cement and concrete materials
Xiangyuan Lan
Xiangyuan Lan
Pengcheng Laboratory
Multimodal LLMPlace RecognitionVisual TrackingPerson Re-identificationObject Detection
C
Chun Yuan
Tsinghua Shenzhen International Graduate School, Tsinghua University