Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses monocular video-driven 4D geometric reconstruction of dynamic scenes. We propose a zero-shot method trained exclusively on synthetic data—without requiring real-world ground-truth annotations. Our approach leverages a pre-trained video diffusion model as a dynamic prior to implicitly enforce temporal geometric consistency—a first in this domain. A multimodal geometric prediction head jointly outputs point clouds, depth maps, and ray maps, with a cross-modal geometric alignment loss ensuring structural coherence across modalities. Additionally, we introduce a sliding-window fusion mechanism to enable robust long-video reconstruction. Evaluated on multiple benchmarks, our method significantly outperforms state-of-the-art approaches such as MonST3R, achieving high-fidelity, generalizable monocular 4D reconstruction. Notably, it demonstrates strong zero-shot transfer performance on real-world videos, confirming its practical applicability and robustness.

Technology Category

Application Category

📝 Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.
Problem

Research questions and friction points this paper is trying to address.

Monocular 3D reconstruction of dynamic scenes
Leveraging video diffusion models for geometric 4D reconstruction
Aligning and fusing multi-modal geometric data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing video diffusion for 3D reconstruction
Multi-modal alignment for geometric fusion
Sliding windows for robust 4D reconstruction
🔎 Similar Papers
No similar papers found.