Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address degraded novel-view synthesis in 3D Gaussian Splatting (3DGS) under sparse input views—caused by inaccurate geometric extrapolation and missing occlusion reasoning—this paper proposes a generative reconstruction framework. First, we introduce a training-free, fine-tuning-free scene-anchored guidance mechanism: 3DGS-rendered sequences serve as strong geometric priors to inversely constrain a pre-trained video diffusion model, enabling semantically coherent completion of occluded and unobserved regions. Second, we design a joint optimization strategy for camera trajectory initialization and 3DGS refinement tailored to sparse-view settings, enhancing geometric robustness. Our method requires no additional supervision or model adaptation, relying solely on frozen pre-trained video diffusion priors and geometric reasoning. Evaluated on multiple challenging sparse-view benchmarks, it significantly outperforms existing 3DGS baselines, achieving state-of-the-art reconstruction quality and generalization capability.

Technology Category

Application Category

📝 Abstract

Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addressing extrapolation and occlusion in sparse-input 3D scene modeling.

Using video diffusion priors to generate plausible interpretations for unseen regions.

Introducing scene-grounding guidance to ensure consistency in generated sequences.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video diffusion models for scene extrapolation

Introduces scene-grounding guidance for consistent sequence generation

Proposes trajectory initialization for occluded regions identification

🔎 Similar Papers

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model