DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of insufficient robustness in global representations for visual place recognition under viewpoint variations, illumination changes, and domain shifts. To this end, we propose a novel approach that leverages the complementary strengths of DINOv2 and CLIP vision foundation models. By introducing a lightweight residual-guided fusion strategy and a VLAQ (Vision-Language Adaptive Query) query-residual global aggregation mechanism, our method effectively integrates learnable query vectors with local token residual responses. This design preserves fine-grained discriminative details while significantly enhancing representation stability. Extensive experiments demonstrate state-of-the-art performance on multiple standard VPR benchmarks—including Pitts30k, Tokyo24/7, and MSLS—with particularly strong robustness under domain shift and long-term appearance variations.

Technology Category

Application Category

📝 Abstract

One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

Problem

Research questions and friction points this paper is trying to address.

visual place recognition

robust representation

domain shift

viewpoint change

illumination variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual place recognition

visual foundation models

complementary fusion