SS4D: Native 4D Generative Model via Structured Spacetime Latents

๐Ÿ“… 2025-12-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work introduces the first native 4D generative model for synthesizing high-fidelity, temporally coherent, and structurally consistent dynamic 3D objects directly from monocular videoโ€”bypassing conventional multi-stage optimization or fine-tuning of pre-trained 3D/video models. Methodologically, it proposes a structured spatiotemporal latent representation that integrates pre-trained image-to-3D priors, dedicated temporal neural layers, and factorized 4D convolutions, augmented with temporal downsampling blocks and occlusion-robust training strategies. The framework enables end-to-end spatiotemporal consistency modeling under sparse 4D supervision, significantly improving geometric stability and motion coherence. Moreover, it supports efficient long-sequence training and inference. Experimental results demonstrate state-of-the-art performance in dynamic 3D reconstruction and novel-view synthesis from monocular video, establishing a new paradigm for monocular dynamic 3D generation.

Technology Category

Application Category

๐Ÿ“ Abstract
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
Problem

Research questions and friction points this paper is trying to address.

Generates dynamic 3D objects from monocular video
Ensures temporal coherence and structural consistency in 4D generation
Addresses 4D data scarcity and enables efficient long-sequence training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 4D generator training for dynamic 3D synthesis
Structured spacetime latents with temporal layers for consistency
Factorized 4D convolutions for efficient long-sequence processing
๐Ÿ”Ž Similar Papers
No similar papers found.