To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

๐Ÿ“… 2025-05-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses visual localization between semantic 3D building models (CityGML LoD2) and mobile mapping imagery (ground- or UAV-captured). It presents the first systematic comparison of handcrafted features (SIFT) versus deep learningโ€“based methods (SuperPoint, LoFTR) for facade texture matching. We propose a semantics-aware evaluation protocol and pose accuracy validation framework, uniformly assessing RANSAC inlier counts, area-under-curve (AUC) of matching recall, and PnP-derived pose errors across HPatches, MegaDepth-1500, and a newly constructed facade dataset. Results demonstrate that learned features substantially improve robustness and accuracy: on our facade dataset, LoFTR achieves 12 RANSAC inliers (versus zero for SIFT), an AUC of 0.16 (versus โ‰ˆ0 for SIFT), and significantly reduced absolute pose errors. The framework enables rigorous, reproducible benchmarking of vision-based localization against semantic urban models.

Technology Category

Application Category

๐Ÿ“ Abstract
Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To_Glue_or_not_to_Glue
Problem

Research questions and friction points this paper is trying to address.

Compares classical vs learned image matching for 3D building models
Evaluates feature-matching techniques in visual localization accuracy
Assesses robustness of methods using custom and benchmark datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares classical vs learned image matching
Evaluates feature-matching for 3D building models
Learnable methods outperform traditional approaches
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Simone Gaisbauer
Professorship of Photogrammetry and Remote Sensing, TUM School of Engineering and Design, Technical University of Munich, 80333 Munich, Germany
P
Prabin Gyawali
Professorship of Photogrammetry and Remote Sensing, TUM School of Engineering and Design, Technical University of Munich, 80333 Munich, Germany
Q
Qilin Zhang
Professorship of Photogrammetry and Remote Sensing, TUM School of Engineering and Design, Technical University of Munich, 80333 Munich, Germany
Olaf Wysocki
Olaf Wysocki
Assistant Research Professor, University of Cambridge
Computer VisionPhotogrammetryMachine Learning
Boris Jutzi
Boris Jutzi
Technical University of Munich (TUM) / Karlsruhe Institute of Technology (KIT)
Active Optical Sensor3D Computer VisionLaser ScanningRemote SensingSignal & Image Processing