๐ค AI Summary
To address the degradation of keypoint matching performance under large viewpoint or appearance changes in camera relocalization, this paper proposes a novel paradigm for robust image feature descriptor generation based on differentiable voxel rendering. Our method constructs a sparse yet locally dense 3D voxel map and synthesizes matchable descriptors for arbitrary viewpoints via voxel rendering. This work is the first to integrate voxel rendering into feature representation learning, unifying globally sparse storage with locally dense rendering while enabling cross-view descriptor synthesisโthereby relaxing the conventional requirement of viewpoint consistency in matching. Evaluated on the 7-Scenes and Cambridge Landmarks datasets, our approach reduces median translation error by 39% in indoor scenes, significantly outperforming state-of-the-art methods; it remains competitive in outdoor scenarios while incurring lower memory and computational overhead.
๐ Abstract
Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.