π€ AI Summary
This work addresses the challenge of generating complete dynamic 4D scenes from a single-view video, which suffers from severe information deficiency and view consistency issues. The authors propose an end-to-end framework that first synthesizes multi-view synchronized videos using a diffusion model and then reconstructs an explicit dynamic scene via 4D Gaussian Splatting (4DGS). Key contributions include the introduction of Real-MV-4D, a large-scale multi-view 4D dataset; a spatio-temporalβview fusion attention mechanism that incorporates geometric priors; and a flow-matching distillation loss to enhance novel-view rendering consistency. Experimental results demonstrate that the proposed method significantly outperforms existing approaches in both visual fidelity and geometric consistency, achieving high-quality full-view dynamic 4D scene generation for the first time.
π Abstract
Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T$\times $V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.