Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses two key limitations in current visual place recognition (VPR) methods: the neglect of varying cluster contributions during feature aggregation and the high computational cost of Vision Transformer (ViT)-based feature extraction. To this end, we propose WeiAD, a weighted aggregation descriptor that introduces a clustering-based weighting mechanism to enhance feature discriminability, and WeiToP, a VPR-oriented token pruning framework featuring a lightweight, self-distillation-guided pruning module. WeiToP enables dynamic trade-offs between accuracy and efficiency at inference time without requiring model retraining. Experimental results demonstrate that WeiAD significantly improves retrieval accuracy, while WeiToP substantially reduces feature extraction latency, outperforming general-purpose pruning approaches across multiple benchmarks.

📝 Abstract

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

Problem

Research questions and friction points this paper is trying to address.

Visual Place Recognition

Feature Aggregation

Token Pruning

Efficiency-Accuracy Trade-off

Vision Transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted Aggregation

Token Pruning

Vision Transformers