🤖 AI Summary
Scene coordinate regression (SCR) suffers from degraded localization accuracy in repetitive-texture and textureless regions due to failure of implicit triangulation. To address this, we propose a unified sequential enhancement framework for SCR. Our key contributions are: (1) the first end-to-end architecture jointly encoding scene features and detecting salient keypoints, explicitly modeling scene saliency to suppress interference from non-informative regions; and (2) the first integration of temporal modeling—via LSTM or GRU—into both the coordinate mapping and relocalization stages of SCR, enforcing inter-frame geometric consistency. The method supports dual-mode inference: single-frame mode achieves a 6.4% recall gain at 90 Hz, while sequence mode yields an 11% recall improvement with unchanged computational efficiency. Extensive evaluation on multi-scale indoor and outdoor benchmarks demonstrates consistent superiority over state-of-the-art methods.
📝 Abstract
Scene Coordinate Regression (SCR) is a visual localization technique that utilizes deep neural networks (DNN) to directly regress 2D-3D correspondences for camera pose estimation. However, current SCR methods often face challenges in handling repetitive textures and meaningless areas due to their reliance on implicit triangulation. In this paper, we propose an efficient and accurate SCR system. Compared to existing SCR methods, we propose a unified architecture for both scene encoding and salient keypoint detection, allowing our system to prioritize the encoding of informative regions. This design significantly improves computational efficiency. Additionally, we introduce a mechanism that utilizes sequential information during both mapping and relocalization. The proposed method enhances the implicit triangulation, especially in environments with repetitive textures. Comprehensive experiments conducted across indoor and outdoor datasets demonstrate that the proposed system outperforms state-of-the-art (SOTA) SCR methods. Our single-frame relocalization mode improves the recall rate of our baseline by 6.4% and increases the running speed from 56Hz to 90Hz. Furthermore, our sequence-based mode increases the recall rate by 11% while maintaining the original efficiency.