🤖 AI Summary
For image-based relocalization in large-scale scenes, existing methods rely on image retrieval or heuristic search, resulting in high computational complexity and substantial memory overhead. This paper introduces a novel paradigm—“visible structure retrieval”—which employs a lightweight neural network to directly predict, in an end-to-end manner, the set of 3D structural points visible from a given input image. By restricting conventional 2D–3D matching to the physically visible subspace, the method eliminates the need for explicit retrieval or hand-crafted heuristics. It jointly leverages structured map priors and a differentiable matching mechanism. Evaluated on multiple large-scale benchmarks, our approach achieves state-of-the-art localization accuracy while reducing matching search complexity to near-linear time and cutting memory consumption by approximately 60%. These improvements significantly enhance both relocalization efficiency and scalability.
📝 Abstract
Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.