Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses novel-view video synthesis under extremely sparse input views (e.g., only a few). Methodologically, it formulates the task as natural video completion, jointly leveraging a pre-trained video diffusion model—generating temporally and spatially coherent intermediate frames—and a 3D Gaussian splatting representation for geometry-aware scene reconstruction. An uncertainty-aware mechanism establishes an iterative feedback loop between 3D geometry estimation and 2D rendering, ensuring both spatial consistency and rendering fidelity. Crucially, the framework operates in a zero-shot, test-time optimization setting—requiring no scene-specific training. Extensive experiments on LLFF, DTU, DL3DV, and MipNeRF-360 benchmarks demonstrate substantial improvements over 3D Gaussian Splatting (3D-GS) baselines, particularly under extreme sparsity. The method achieves high-fidelity, spatiotemporally coherent novel-view videos while exhibiting superior robustness and generalization without any per-scene adaptation.

Technology Category

Application Category

📝 Abstract
Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as emph{completing a natural video} unfolding through space. We recast the task as emph{test-time natural video completion}, using powerful priors from emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing novel views from sparse input images
Completing natural videos between widely spaced camera poses
Reconstructing 3D scenes without scene-specific training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pretrained video diffusion models for completion
Uses uncertainty-aware mechanism for spatial coherence
Integrates 3D Gaussian Splatting with iterative feedback loop
🔎 Similar Papers
No similar papers found.