GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing video depth estimation methods rely on affine-invariant predictions, compromising geometric fidelity and limiting performance in metric tasks such as 3D/4D reconstruction and camera calibration. To address this, we propose the first high-fidelity point cloud sequence reconstruction framework for open-world videos. Our approach introduces a Point-VAE that learns a geometry-agnostic latent space, couples it with a conditional video diffusion model to explicitly capture spatiotemporal point cloud distributions, and incorporates geometric-aware latent constraints alongside temporal consistency optimization. Evaluated on multiple benchmarks, our method significantly improves 3D accuracy, cross-domain generalization, and inter-frame consistency. Notably, it achieves end-to-end, temporally consistent point cloud sequence generation without any depth supervision—the first such result—setting new state-of-the-art performance in both reconstruction quality and geometric reliability.

Technology Category

Application Category

📝 Abstract

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

Problem

Research questions and friction points this paper is trying to address.

Enhances geometric fidelity in video depth estimation

Enables accurate 3D/4D reconstruction from open-world videos

Improves temporal consistency and generalization in depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses point map VAE for encoding and decoding

Trains video diffusion model for point sequences

Achieves high-fidelity 3D accuracy consistently

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos