🤖 AI Summary
To address the low pose estimation accuracy and reliance on explicit 3D reconstruction in image retrieval (IR)-based visual localization, this paper proposes a novel 3D-model-free hybrid pose estimation framework. Our method decouples translation and rotation estimation, and—uniquely—directly localizes the camera center using only relative translation constraints, bypassing conventional intermediate representations of relative poses. It enables end-to-end regression of global camera poses from multi-view sparse feature matches and geometric constraints. The pipeline integrates image feature indexing, relative translation estimation, multi-view geometric optimization, and a lightweight pose solver. Evaluated on the 7-Scenes and Cambridge Landmarks benchmarks, our approach achieves state-of-the-art localization accuracy while significantly reducing computational latency and memory footprint compared to prior methods.
📝 Abstract
The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.