LaVPR: Benchmarking Language and Vision for Place Recognition

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the significant performance degradation of visual place recognition under extreme environmental changes and perceptual aliasing, as well as the challenge of achieving localization solely from natural language descriptions. To this end, we introduce LaVPR, a large-scale benchmark that uniquely integrates 650,000 natural language descriptions with existing visual place datasets, enabling exploration of both multimodal fusion and cross-modal retrieval paradigms. By incorporating low-rank adaptation (LoRA) and a multi-similarity loss, our approach substantially enhances robustness under visually degraded conditions. The proposed method not only enables purely language-driven localization but also allows smaller models to approach the performance of large vision models, while significantly outperforming standard contrastive learning approaches in cross-modal retrieval.

Technology Category

Application Category

📝 Abstract

Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform"blind"localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.

Problem

Research questions and friction points this paper is trying to address.

Visual Place Recognition

Environmental Changes

Perceptual Aliasing

Language-based Localization

Blind Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Augmented VPR

Cross-Modal Retrieval

Multi-Modal Fusion