🤖 AI Summary
This work addresses novel-view video synthesis under extremely sparse input views (e.g., only a few). Methodologically, it formulates the task as natural video completion, jointly leveraging a pre-trained video diffusion model—generating temporally and spatially coherent intermediate frames—and a 3D Gaussian splatting representation for geometry-aware scene reconstruction. An uncertainty-aware mechanism establishes an iterative feedback loop between 3D geometry estimation and 2D rendering, ensuring both spatial consistency and rendering fidelity. Crucially, the framework operates in a zero-shot, test-time optimization setting—requiring no scene-specific training. Extensive experiments on LLFF, DTU, DL3DV, and MipNeRF-360 benchmarks demonstrate substantial improvements over 3D Gaussian Splatting (3D-GS) baselines, particularly under extreme sparsity. The method achieves high-fidelity, spatiotemporally coherent novel-view videos while exhibiting superior robustness and generalization without any per-scene adaptation.
📝 Abstract
Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as emph{completing a natural video} unfolding through space.
We recast the task as emph{test-time natural video completion}, using powerful priors from emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views.
The result is coherent, high-fidelity renderings from sparse inputs emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.