Novel View Synthesis as Video Completion

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the problem of novel view synthesis from only about five sparse, posed multi-view images by proposing FrameCrafter, a framework that reframes the task as low-frame-rate video completion. By removing temporal positional encodings and introducing per-frame latent codes, the method endows video diffusion models with permutation invariance over the input view set, enabling compatibility with unordered sparse inputs. Without relying on explicit 3D modeling, FrameCrafter leverages the implicit multi-view consistency prior inherent in video diffusion models, augmented with tailored architectural modifications and fine-tuning strategies. Evaluated on standard benchmarks, the approach achieves competitive performance, demonstrating that video diffusion models can effectively “forget” temporal dynamics with minimal supervision and successfully adapt to novel view synthesis.

Technology Category

Application Category

📝 Abstract

We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/

Problem

Research questions and friction points this paper is trying to address.

novel view synthesis

sparse views

video diffusion models

multi-view knowledge

permutation invariance

Innovation

Methods, ideas, or system contributions that make the work stand out.

novel view synthesis

video diffusion models

permutation invariance