D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual place recognition (VPR) faces significant challenges in deploying large foundation models—e.g., DINOv2—on resource-constrained edge devices due to their excessive parameter count and computational overhead. To address this, we propose an efficient knowledge distillation and deformable aggregation framework. First, we design a two-stage distillation strategy augmented with a Distillation Recovery Module (DRM) to enhance feature-space alignment between teacher and student models. Second, we introduce a top-down attention-driven deformable aggregator (TDDA) that dynamically focuses on salient structural regions for robust place matching. Our approach preserves DINOv2’s strong representational capacity while drastically reducing model size and complexity. Experiments demonstrate that our method reduces parameters by 64.2% and floating-point operations by 62.6% compared to the state-of-the-art CricaVPR, with negligible degradation in VPR accuracy—thereby markedly improving feasibility for edge deployment.

Technology Category

Application Category

📝 Abstract
Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and FLOPs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.
Problem

Research questions and friction points this paper is trying to address.

Reducing model complexity of visual foundation models for efficient deployment
Minimizing knowledge transfer loss between teacher and student models
Improving adaptability to irregular structures through deformable aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation reduces model parameters
Deformable aggregation adaptively adjusts regions of interest
Two-stage training aligns teacher-student feature spaces
🔎 Similar Papers
No similar papers found.
Z
Zheyuan Zhang
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
J
Jiwei Zhang
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
Boyu Zhou
Boyu Zhou
Assistant Professor, SUSTech
Roboticsaerial robotsactive perceptionmobile manipulation
L
Linzhimeng Duan
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
H
Hong Chen
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications