Bolt3D: Generating 3D Scenes in Seconds

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of fast, high-fidelity 3D scene generation from single or multiple input images. We propose the first end-to-end latent-space 3D generation paradigm built upon 2D diffusion models. Methodologically: (1) we design a lightweight latent 3D representation enabling joint modeling of geometry and appearance; (2) we construct the first large-scale multiview-consistent 3D reconstruction dataset; and (3) we jointly train via diffusion model adaptation—leveraging pretrained 2D diffusion networks—and dense multiview reconstruction (using COLMAP initialization followed by NeRF optimization). Our method generates multiview-consistent, photorealistic 3D scenes in just 7 seconds on a single GPU. It achieves state-of-the-art performance in inference speed (up to 300× faster), geometric consistency, and rendering fidelity.

Technology Category

Application Category

📝 Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
Problem

Research questions and friction points this paper is trying to address.

Fast 3D scene generation from images
Reducing inference cost for 3D reconstruction
Leveraging 2D diffusion networks for 3D scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion model for 3D scene generation
Single GPU generates 3D scenes in seconds
Leverages 2D diffusion networks for 3D consistency
🔎 Similar Papers
No similar papers found.