🤖 AI Summary
This work addresses the challenge of accurate image-based localization in real-world complex environments, where existing methods rely on small-scale vectorized floorplans and struggle with ordinary images. The authors propose a novel approach that first reconstructs a gravity-aligned 3D scene from unconstrained input images to generate a 2D density map as a proxy for the floorplan, then aligns this map with the given floorplan via a 2D similarity transformation. By innovatively integrating 3D reconstruction with 2D foundation models, the method employs a fine-tuning strategy that enforces semantic alignment and structural consistency, enabling robust performance even with extremely sparse inputs—such as a single image. Experiments demonstrate that the proposed technique significantly outperforms state-of-the-art methods across diverse real-world scenarios, achieving reliable localization with minimal data.
📝 Abstract
Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.