Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Current visual place recognition (VPR) methods have advanced semantic understanding but remain hindered by high computational overhead and poor cross-domain generalization. To address these limitations, we propose a zero-shot VPR framework that requires no fine-tuning. Our method leverages test-time expansion and uncertainty-aware self-consistency reasoning, coupled with guided structured prompting—enforcing JSON-formatted outputs—to fully harness the image-text alignment capabilities of multimodal large language models (MLLMs). By bypassing conventional two-stage pipelines, it enables end-to-end, real-time adaptive matching. Evaluated under cross-domain settings, our approach achieves substantial accuracy improvements while delivering up to 210× computational efficiency gain over prior methods. It exhibits strong generalization across diverse environments and demonstrates practical potential for real-time deployment.

Technology Category

Application Category

📝 Abstract
Visual Place Recognition (VPR) has evolved from handcrafted descriptors to deep learning approaches, yet significant challenges remain. Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned. To address these limitations, we propose a novel zero-shot framework employing Test-Time Scaling (TTS) that leverages MLLMs' vision-language alignment capabilities through Guidance-based methods for direct similarity scoring. Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable JSON outputs. The TTS framework with Uncertainty-Aware Self-Consistency (UASC) enables real-time adaptation without additional training costs, achieving superior generalization across diverse environments. Experimental results demonstrate significant improvements in cross-domain VPR performance with up to 210$ imes$ computational efficiency gains.
Problem

Research questions and friction points this paper is trying to address.

High computational overhead in multimodal visual recognition systems
Limited cross-domain transferability of fine-tuned vision models
Inefficient two-stage processing in visual place recognition methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling for zero-shot VPR
Guidance-based similarity scoring method
Uncertainty-Aware Self-Consistency framework
🔎 Similar Papers
No similar papers found.
J
Jintao Cheng
Hong Kong University of Science and Technology, Hong Kong, China
W
Weibin Li
South China Normal University, Shanwei, Guangdong, China
Jiehao Luo
Jiehao Luo
South China Normal University
Computer Vision3D Perception
X
Xiaoyu Tang
South China Normal University, Shanwei, Guangdong, China
Z
Zhijian He
Shenzhen Technology University, Shenzhen, Guangdong, China
J
Jin Wu
University of Science and Technology Beijing, Beijing, China
Y
Yao Zou
University of Science and Technology Beijing, Beijing, China
W
Wei Zhang
Hong Kong University of Science and Technology, Hong Kong, China