Stable Virtual Camera: Generative View Synthesis with Diffusion Models

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing novel view synthesis methods struggle with large viewpoint variations and temporally coherent generation, often relying on task-specific configurations or explicit 3D priors (e.g., NeRF, 3D Gaussian Splatting). This paper introduces the first end-to-end diffusion model for general-purpose novel view synthesis, supporting arbitrary numbers of input views and arbitrary target camera poses without requiring 3D reconstruction. Our core innovation is a novel diffusion architecture designed to jointly ensure generalizability and temporal consistency, integrated with an optimized training strategy and flexible sampling mechanism. Evaluated across multiple benchmarks, our method significantly outperforms state-of-the-art approaches. It enables plug-and-play generation of high-fidelity, naturally looping videos up to 30 seconds long—without post-processing or knowledge distillation.

Technology Category

Application Category

📝 Abstract

We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.

Problem

Research questions and friction points this paper is trying to address.

Generates novel scene views from input views and target cameras.

Overcomes limitations in large viewpoint changes and temporal smoothness.

Produces high-quality, consistent videos without 3D representation distillation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalist diffusion model for novel view synthesis

Flexible sampling strategy across view synthesis tasks

Generates high-quality, temporally smooth video samples

🔎 Similar Papers

No similar papers found.