Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address structural ambiguity and geometric detail deficiency in high-fidelity 3D urban scene generation from single satellite imagery, this paper proposes a cascaded latent diffusion framework. First, multi-scale feature representation of sparse voxel grids is enhanced via a re-hashing operation applied at the bottleneck of a variational autoencoder. Second, an inverse sampling strategy is introduced to enable implicit supervision, improving 3D structural consistency and smoothness of appearance transitions. Third, a large-scale synthetic 3D urban dataset is constructed to support training and evaluation. Experiments demonstrate that our method achieves significant improvements over state-of-the-art neural rendering approaches in both geometric accuracy and visual realism. The framework establishes a scalable, high-fidelity generative paradigm for digital twin construction and virtual urban modeling.

Technology Category

Application Category

📝 Abstract

Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning.To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

Problem

Research questions and friction points this paper is trying to address.

Generate detailed 3D city structures from single satellite images

Overcome structural ambiguity in 2D-to-3D urban scene generation

Address lack of real-world 3D city datasets with high fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded latent diffusion for 3D recovery

Re-Hash VAE for multi-scale features

Inverse sampling for smooth appearance

🔎 Similar Papers

No similar papers found.