Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 52

✨ Influential: 9

career value

215K/year

🤖 AI Summary

This work addresses generic 3D scene reconstruction and novel-view synthesis from a single image. We propose an efficient feed-forward Gaussian rasterization framework. Methodologically, we introduce a hierarchical spatially offset Gaussian prediction mechanism to mitigate modeling difficulties in occluded and truncated regions under single-image supervision; 3D Gaussians are generated end-to-end via fine-tuning a monocular depth foundation model. Our contributions include: (i) achieving superior performance over fully supervised target-domain methods—and even surpassing some multi-view approaches—under cross-dataset transfer (e.g., RealEstate10K → NYU/KITTI); (ii) attaining significantly higher PSNR than prior single-image methods on KITTI; and (iii) enabling training on a single GPU in approximately one day. The method jointly advances generalization, efficiency, and reconstruction fidelity, establishing new state-of-the-art performance for single-image 3D reconstruction.

Technology Category

Application Category

📝 Abstract

We propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a"foundation"model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.

Problem

Research questions and friction points this paper is trying to address.

Generalizable 3D scene reconstruction from single image

Efficient feed-forward Gaussian Splatting for occlusion completion

State-of-the-art performance on unseen datasets like NYU and KITTI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends monocular depth to 3D reconstruction

Uses feed-forward Gaussian Splatting

Layers offset Gaussians for occlusion completion

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View