Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenge of generating high-fidelity street-view 3D scenes from a single satellite image, which inherently involves a trade-off between geometric accuracy and semantic diversity. The authors propose a geometry-prior-driven, end-to-end generative framework that explicitly mitigates view discrepancy and geometry distortions caused by sparse supervision—without relying on additional refinement modules—by integrating digital surface model (DSM) supervision, perspective projection constraints, and multi-view consistency optimization. Evaluated on the newly introduced VIGOR-OOD+DSM benchmark, the method reduces geometric RMSE from 6.76 m to 5.20 m and significantly improves FID from approximately 40 to 19. Furthermore, the approach demonstrates strong generalization across diverse downstream applications, including semantic map-to-3D synthesis, multi-camera video generation, large-scale mesh reconstruction, and unsupervised DSM estimation.

📝 Abstract

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

Problem

Research questions and friction points this paper is trying to address.

3D scene generation

satellite-to-street

geometric fidelity

semantic diversity

viewpoint gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-first

satellite-to-street 3D generation

perspective-view training