Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

πŸ“… 2025-11-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the end-to-end generation of controllable explicit 4D scenes (appearance + geometry + motion) from a single input image, without test-time optimization or post-processing. We propose the first feed-forward framework that unifies video diffusion priors with large-scale 4D data-driven geometric and motion constraints, directly predicting time-varying 3D Gaussian fields via a video latent-space Transformer. Our method jointly models spatiotemporal dependencies and optimizes appearance fidelity, geometric accuracy, and motion consistency in a unified loss. In a single inference pass (within seconds), it generates high-fidelity 4D scenesβ€”fully synthesizing 30-second sequences in under 30 seconds. It matches or surpasses optimization-based state-of-the-art methods on video generation, novel-view synthesis, and dynamic geometry reconstruction, while significantly improving efficiency and controllability for dynamic scene generation.

Technology Category

Application Category

πŸ“ Abstract
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.
Problem

Research questions and friction points this paper is trying to address.

Generates controllable 4D scenes from single images
Unifies video diffusion models with geometry constraints
Predicts deformable 3D Gaussian fields without optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward method synthesizes controllable explicit 4D scenes
Unifies video diffusion models with geometry motion constraints
Predicts deformable 3D Gaussian field in single forward pass
πŸ”Ž Similar Papers
No similar papers found.