🤖 AI Summary
To address the trade-off between efficiency and accuracy in visual localization on edge devices, this paper proposes PRAM: a framework that leverages self-supervised 3D landmark generation—requiring no semantic labels—to replace redundant global or local descriptors with sparse, geometrically meaningful keypoints. These keypoints drive a lightweight Transformer for landmark identification and 2D–3D matching, integrated with robust outlier rejection. Its core innovation is the landmark-centric paradigm: abandoning pixel-level or scene-level matching in favor of geometry-consistent, sparse landmark-driven correspondence. Experiments across large-scale indoor and outdoor scenes show PRAM matches hierarchical methods in accuracy while significantly outperforming APR and SCR. It reduces memory footprint by over 90% and accelerates inference by 2.4×, achieving—for the first time on edge devices—an optimal balance between high accuracy and high efficiency.
📝 Abstract
Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.