Extend3D: Town-Scale 3D Generation

๐Ÿ“… 2026-03-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of existing object-centric 3D generative models, which struggle to represent large-scale urban scenes due to fixed latent space dimensions. The authors propose a training-free, scene-level 3D generation method that extends and tiles the latent space along the x and y axes, initializes a point cloud prior using monocular depth estimation, and iteratively refines occluded regions via an โ€œunder-denoisingโ€ strategy combined with SDEdit. A key innovation lies in the coupling of latent space expansion with a tiling mechanism, guided by a 3D-aware optimization objective designed to enforce dynamic consistency across sub-scenes. Experimental results demonstrate that the proposed approach significantly outperforms current methods in both human preference studies and quantitative metrics, achieving superior geometric structure and texture fidelity in large-scale scene generation.
๐Ÿ“ Abstract
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.
Problem

Research questions and friction points this paper is trying to address.

town-scale 3D generation
single-image 3D reconstruction
latent space extension
object-centric 3D generative model
3D scene completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extend3D
object-centric 3D generation
latent space extension
under-noising
3D-aware optimization
๐Ÿ”Ž Similar Papers
No similar papers found.