TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition

πŸ“… 2025-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
For resource-constrained platforms such as UAVs and mobile robots, visual place recognition (VPR) faces challenges including large model size, high inference latency, and difficulty balancing accuracy and efficiency. This paper proposes a lightweight and efficient VPR method: the first joint compression paradigm integrating a progressive ternarized Vision Transformer (ViT) backbone with a binarized embedding layer, coupled with a representation-preserving progressive distillation strategy. This enables ultra-low-bit models (2-bit backbone + 1-bit embedding) to surpass full-precision CNN baselines in accuracy. Experiments show that, compared to state-of-the-art efficient baselines, our method reduces memory footprint by 69% and inference latency by 35%, while maintaining or slightly improving Recall@1β€”achieving, for the first time, high-accuracy, low-overhead VPR deployment on edge devices.

Technology Category

Application Category

πŸ“ Abstract
Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and compute requirements for Visual Place Recognition.
Enables high-accuracy VPR on resource-constrained robotic platforms.
Achieves significant model size and latency reduction without accuracy loss.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary transformer reduces model size
Progressive distillation preserves accuracy
2-bit quantization lowers memory usage
πŸ”Ž Similar Papers
No similar papers found.
O
Oliver Grainge
School of Electronics and Computer Science, University of Southampton, United Kingdom
Michael Milford
Michael Milford
QUT Professor | Director, QUT Robotics Centre | ARC Laureate Fellow | Microsoft Fellow
Roboticscomputational neurosciencenavigationSLAMRatSLAM
I
I. Bodala
School of Electronics and Computer Science, University of Southampton, United Kingdom
S
Sarvapali D. Ramchurn
School of Electronics and Computer Science, University of Southampton, United Kingdom
Shoaib Ehsan
Shoaib Ehsan
Assoc. Prof, University of Southampton | Reader, University of Essex | Co-I, Responsible AI UK
Computer VisionRoboticsEmbedded SystemsResponsible AIVisual Place Recognition