🤖 AI Summary
This work addresses the challenge of fragmented or erroneously fused geometry commonly produced by existing methods when reconstructing large-scale real-world scenes from unstructured, non-overlapping in-the-wild images. To achieve global consistency, the authors propose a novel modeling framework that leverages semantic alignment to jointly optimize the 6DoF pose and scale between local reconstructions and a geographically accurate pseudo-synthetic reference model generated via Google Earth Studio. The reference model is represented using 3D Gaussian Splatting enriched with semantic features, enabling robust registration through inverse feature optimization. To support this task, the authors also introduce the WikiEarth dataset. Experiments demonstrate that the proposed approach significantly improves global alignment accuracy across both classical and learning-based reconstruction pipelines, effectively mitigating failure modes prevalent in end-to-end models.
📝 Abstract
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.