SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address poor spatiotemporal consistency and limited capability in generating high-fidelity 4D content in multi-view video diffusion models, this paper introduces SV4D 2.0—the first multi-view video diffusion framework specifically designed for dynamic 3D asset generation. Our method innovates with a 3D-to-4D progressive training paradigm, a reference-free multi-view dependency modeling mechanism, a joint 3D/frame attention fusion module, and a two-stage 4D consistency optimization coupled with progressive frame sampling. Trained on large-scale multi-view video data, SV4D 2.0 achieves significant improvements: on novel-view video synthesis, it reduces LPIPS by 14% and FV4D by 44%; on 4D optimization, it lowers LPIPS by 12% and FV4D by 24%. These gains markedly enhance occlusion robustness, adaptability to large motion, and generalization to real-world videos.

Technology Category

Application Category

📝 Abstract

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14% LPIPS) and 4D consistency (-44% FV4D) in novel-view video synthesis and 4D optimization (-12% LPIPS and -24% FV4D) compared to SV4D. Project page: https://sv4d2.0.github.io.

Problem

Research questions and friction points this paper is trying to address.

Improves spatio-temporal consistency in 4D video generation

Enhances robustness to occlusions and large motion

Optimizes 3D asset generation for real-world video applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced network architecture with 3D-frame attention blending

Improved training data quality and quantity

Progressive 3D-4D training for better generalization

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency