4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Generating high-dimensional 4D content (e.g., dense-view videos) suffers from poor cross-view-temporal consistency and difficulty in optimizing explicit 4D representations (e.g., 4D Gaussians). Method: We propose a cascaded two-stage diffusion framework: Stage I generates coarse multi-view layouts; Stage II refines spatiotemporal details via a structure-aware conditional network that fuses monocular video appearance and geometric priors. Contribution/Results: We introduce cross-view temporal attention and a unified structural guidance strategy—enabling the first differentiable, disentangled explicit optimization for 4D generation. Trained on our newly constructed dynamic 3D object dataset, D-Objaverse, our model achieves state-of-the-art performance in novel-view synthesis and 4D video generation, supporting high-fidelity outputs of 21 frames × 16 views, with significantly improved spatiotemporal consistency and joint geometry-appearance fidelity.

Technology Category

Application Category

📝 Abstract

Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is https://4dvd.github.io/

Problem

Research questions and friction points this paper is trying to address.

Generates high-quality 4D content decoupled into subtasks

Improves cross-view and temporal consistency in dense-view videos

Enables accurate optimization of explicit 4D representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded video diffusion model for 4D

Decoupled multi-view layout generation

Structure-aware spatio-temporal generation branch

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency