π€ AI Summary
This work addresses the challenge of zero-shot, end-to-end 3D reconstruction of public buildings from minimal inputβnamely, only a building name, address, or geographic coordinates. The proposed method automatically retrieves multi-view satellite and street-level imagery via Google Earth Studio, then leverages SAM2 and GroundingDINO for text- or click-driven, zero-shot instance segmentation of the target building without prior training or annotations. To enhance geometric fidelity, it introduces an improved 2D Gaussian Splatting formulation and refines segmentation masks using morphological operations and contour simplification, significantly improving boundary consistency and structural integrity. The core contribution is the first integration of text/interactive segmentation with Gaussian rasterization-based 3D modeling, eliminating reliance on labeled data or manual intervention. Experiments demonstrate robust performance in complex urban environments, consistently generating architecturally plausible 3D meshes with high structural coherence and photorealistic texture fidelity.
π Abstract
Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D representation of a scene's geometry and radiance based on 2D images. Combining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and our improvements in mask refinement based on morphological operations and contour simplification, we created a pipeline to extract the 3D mesh of any building based on its name, address, or geographic coordinates.