🤖 AI Summary
Existing 3D scene generation methods rely on 2D multi-view or video diffusion models, suffering from representation redundancy and insufficient spatial consistency. This work proposes direct scene generation in an implicit 3D latent space by introducing a 3D Representation Autoencoder (3DRAE) that maps multi-view 2D semantics into a compact, view-disentangled 3D representation. Coupled with a 3D Diffusion Transformer (3DDiT), this framework enables efficient and consistent scene synthesis. Notably, it is the first approach to support rendering at arbitrary viewpoints, resolutions, and aspect ratios under fixed computational complexity, allowing generation of any camera trajectory without per-trajectory sampling and enabling efficient decoding into either images or point maps. Experiments demonstrate that the generated scenes exhibit significantly improved spatial consistency compared to existing 2D-based methods.
📝 Abstract
3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.