Fillerbuster: Multi-View Scene Completion for Casual Captures

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Sparse, uncalibrated multi-view images in everyday scenes often suffer from severe occlusions—particularly behind or above objects—leading to incomplete 3D reconstructions. Method: We propose the first unified generative framework that jointly synthesizes novel-view images and estimates their corresponding camera poses. Our approach leverages a large-scale multi-view latent diffusion Transformer to jointly model pose estimation and image synthesis end-to-end, while recovering unknown intrinsic and extrinsic camera parameters in a fully unsupervised manner. Contribution/Results: To our knowledge, this is the first method enabling large-scale scene completion from uncalibrated multi-view inputs. Evaluated on two benchmark datasets, it achieves high-fidelity 3D reconstruction of unobserved regions and photorealistic novel-view synthesis, significantly improving geometric completeness and visual consistency under sparse input conditions.

Technology Category

Application Category

📝 Abstract

We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.

Problem

Research questions and friction points this paper is trying to address.

Completes unknown 3D scene regions

Handles sparse casual captures effectively

Predicts new content and image poses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view latent diffusion transformer

Generative model for scene completion

Unified model predicts poses

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View