🤖 AI Summary
This work addresses the challenge of reconstructing complete and accurate 3D scenes from extremely sparse inputs—such as a single image—where Gaussian Splatting struggles, particularly in occluded or unobserved regions. The authors propose a training-free generative reconstruction framework that, for the first time, integrates video diffusion models into the Gaussian Splatting pipeline without requiring additional training. Their approach employs a geometry-guided, multi-stage denoising strategy coupled with a confidence-based iterative view completion mechanism to synthesize geometrically consistent novel views that fill in missing scene content. By further incorporating adaptive camera trajectory sampling and confidence-weighted optimization, the method achieves state-of-the-art performance across multiple benchmarks, enabling robust, high-fidelity, and complete 3D reconstructions from highly sparse observations.
📝 Abstract
Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.