SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

📅 2024-07-24

🏛️ arXiv.org

📈 Citations: 15

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling temporal coherence across frames and multi-view consistency in monocular video-driven dynamic 3D generation. We propose the first end-to-end latent video diffusion framework that directly synthesizes temporally consistent novel-view videos in latent space, while implicitly reconstructing high-fidelity 4D dynamic NeRF representations. Departing from conventional score distillation sampling (SDS) optimization, our method integrates explicit temporal modeling with multi-view geometric constraints. Key innovations include a time-consistency regularization loss and a self-constructed dynamic 3D dataset derived from Objaverse to support training. Our approach achieves state-of-the-art performance on multiple benchmarks for both novel-view video synthesis and 4D dynamic scene generation. User studies confirm significant improvements in visual realism and spatiotemporal coherence over existing methods.

Technology Category

Application Category

📝 Abstract

We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that rely on separately trained generative models for video generation and novel view synthesis, we design a unified diffusion model to generate novel view videos of dynamic 3D objects. Specifically, given a monocular reference video, SV4D generates novel views for each video frame that are temporally consistent. We then use the generated novel view videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for cumbersome SDS-based optimization used in most prior works. To train our unified novel view video generation model, we curate a dynamic 3D object dataset from the existing Objaverse dataset. Extensive experimental results on multiple datasets and user studies demonstrate SV4D's state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works.

Problem

Research questions and friction points this paper is trying to address.

Generates novel view videos of dynamic 3D objects.

Ensures temporal consistency across video frames.

Optimizes implicit 4D representation without SDS-based optimization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion model for 3D video generation

Efficient 4D representation optimization without SDS

Dynamic 3D dataset curated from Objaverse

🔎 Similar Papers

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video