Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address geometric distortion and poor generalization in monocular dynamic video novel-view synthesis, this paper proposes CogNVS: a two-stage framework decoupling the task into geometry-driven 3D dynamic reconstruction (with differentiable rendering) and generative-prior-guided 2D video completion. We introduce the first self-supervised video diffusion model, enabling zero-shot training on unlabeled real-world videos without manual annotations. Additionally, we incorporate lightweight test-time fine-tuning to enhance cross-scene generalization. Crucially, our method avoids expensive 4D voxel representations and mitigates geometric collapse during feedforward training. Evaluated on multiple dynamic scene benchmarks, CogNVS significantly outperforms prior approaches, producing novel views with high geometric fidelity, strong temporal consistency, and superior detail recovery. The framework establishes an efficient, robust, and scalable paradigm for monocular dynamic novel-view synthesis.

Technology Category

Application Category

📝 Abstract

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Problem

Research questions and friction points this paper is trying to address.

Novel-view synthesis for dynamic scenes from monocular videos

Preserving scene geometry without costly 4D optimization

Self-supervised inpainting of hidden pixels in novel views

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstruct dynamic 3D scenes for covisible pixels

Inpaint hidden pixels using 2D video diffusion models

Self-supervised training on large video datasets

🔎 Similar Papers

No similar papers found.

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence