Can Video Diffusion Model Reconstruct 4D Geometry?

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video. We propose Sora3R, the first framework to transfer spatiotemporal latent priors from large-scale video diffusion models to 4D geometric reconstruction. Sora3R generates temporally consistent 4D point maps via a two-stage feed-forward pipeline—without relying on auxiliary supervision (e.g., depth, optical flow, or segmentation), iterative optimization, or global trajectory alignment. Instead, it operates solely on monocular video input. Key technical components include: (1) adaptation of a pointmap variational autoencoder (Pointmap VAE), (2) fine-tuning of a joint video–pointmap latent-space diffusion model, and (3) end-to-end inference. Extensive experiments demonstrate stable recovery of camera poses and high-fidelity 4D geometry across diverse dynamic scenes, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing 4D geometry from monocular video
Overcoming dynamic motion challenges in 3D scenes
Inferring 4D pointmaps without iterative alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion models for 4D reconstruction
Two-stage pipeline with pointmap VAE adaptation
Feedforward processing without external modules
🔎 Similar Papers
No similar papers found.
Jinjie Mai
Jinjie Mai
KAUST
3D Vision
Wenxuan Zhu
Wenxuan Zhu
MS/PhD KAUST
Haozhe Liu
Haozhe Liu
KAUST
Computer VisionReinforcement LearningMultimodalImage/Video Generation
B
Bing Li
King Abdullah University of Science and Technology (KAUST)
C
Cheng Zheng
King Abdullah University of Science and Technology (KAUST)
J
Jurgen Schmidhuber
King Abdullah University of Science and Technology (KAUST)
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning