Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

📅 2024-11-21

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing single-image 3D generation methods rely on 2D multi-view diffusion models, suffering from poor 3D consistency and being restricted to object-centric inputs. This paper proposes DiffusionGS—the first single-stage, end-to-end 3D Gaussian splatting diffusion model—that directly synthesizes view-robust, geometrically consistent 3D Gaussian splatting representations from a single image, supporting both object- and scene-level reconstruction without depth estimation or explicit multi-view supervision. We innovatively embed Gaussian splatting parameters into the diffusion denoiser and introduce a view-conditioned modeling scheme alongside a scene-object hybrid training strategy. Quantitatively, DiffusionGS achieves +2.20 dB and +1.34 dB PSNR gains on object- and scene-level benchmarks, respectively, and reduces FID by 23.25 and 19.16 points. It attains an inference speed of 6 seconds per sample on an A100 GPU—five times faster than current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5$ imes$ faster speed ($sim$6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results.

Problem

Research questions and friction points this paper is trying to address.

Ensures 3D consistency in image-to-3D generation.

Handles non-object-centric inputs and arbitrary view directions.

Improves speed and quality without depth estimation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage 3D diffusion model for image-to-3D

Direct 3D Gaussian point clouds output

Scene-object mixed training strategy

🔎 Similar Papers

No similar papers found.