Controlling Space and Time with Diffusion Models

📅 2024-07-10
🏛️ arXiv.org
📈 Citations: 25
Influential: 6
📄 PDF
🤖 AI Summary
This work addresses the problem of 4D novel view synthesis (NVS) under arbitrary camera trajectories and timestamps, conditioned on single or multiple input images in natural scenes. To this end, we propose 4DiM—the first general-purpose 4D generative framework based on a cascaded diffusion architecture—breaking away from restrictive object-centric assumptions to support complex, open-world scenes. Methodologically, we introduce: (i) a structured-light motion reconstruction calibration pipeline enabling metric-scale pose control; (ii) a hybrid co-training paradigm integrating pose, time, and video modalities; and (iii) a conditional sampling mechanism for fine-grained spatiotemporal control. Experiments demonstrate that 4DiM significantly outperforms existing 3D NVS methods in both image fidelity and pose alignment accuracy. Moreover, it unifies diverse tasks—including single-image 3D generation, two-frame video interpolation/extrapolation, and pose-driven video-to-video translation—within a single framework. (149 words)

Technology Category

Application Category

📝 Abstract
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works that focus on limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. For an overview see https://4d-diffusion.github.io
Problem

Research questions and friction points this paper is trying to address.

Enables 4D novel view synthesis with arbitrary camera trajectories and timestamps
Improves generalization using mixed 3D, 4D, and video data training
Provides intuitive metric-scale camera pose control for dynamic scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded diffusion model for 4D view synthesis
Training on mixed 3D 4D and video data
Metric-scale camera pose control calibration
🔎 Similar Papers
No similar papers found.