🤖 AI Summary
This paper addresses the challenging problem of establishing pixel-level geometric correspondences between ground-level photographs and floor plans—characterized by cross-view (e.g., aerial vs. ground) and cross-modal (e.g., photorealistic images vs. abstract line drawings) discrepancies. To this end, we propose an end-to-end alignment method based on point-map prediction. To support this research, we introduce C3, the first large-scale multimodal paired dataset, comprising 90K image–floor-plan pairs and 153M pixel-level correspondence annotations. Our method integrates structure-from-motion (SfM) reconstruction, manual registration, 3D scene modeling, and deep correspondence learning to achieve robust cross-modal geometric matching. Evaluated on C3, our approach reduces root-mean-square error (RMSE) by 34% over the strongest baseline, substantially advancing the state of the art in cross-modal geometric understanding.
📝 Abstract
Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo--floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.