🤖 AI Summary
This paper addresses monocular video-driven 4D geometric reconstruction of dynamic scenes. We propose a zero-shot method trained exclusively on synthetic data—without requiring real-world ground-truth annotations. Our approach leverages a pre-trained video diffusion model as a dynamic prior to implicitly enforce temporal geometric consistency—a first in this domain. A multimodal geometric prediction head jointly outputs point clouds, depth maps, and ray maps, with a cross-modal geometric alignment loss ensuring structural coherence across modalities. Additionally, we introduce a sliding-window fusion mechanism to enable robust long-video reconstruction. Evaluated on multiple benchmarks, our method significantly outperforms state-of-the-art approaches such as MonST3R, achieving high-fidelity, generalizable monocular 4D reconstruction. Notably, it demonstrates strong zero-shot transfer performance on real-world videos, confirming its practical applicability and robustness.
📝 Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.