🤖 AI Summary
This work proposes a four-dimensional reconstruction framework based on video diffusion models to jointly recover dense 3D geometry and scene motion from monocular videos. The method introduces a unified representation of dense 3D point maps and 3D scene flow within a shared coordinate system and designs a novel 4D variational autoencoder (VAE) for end-to-end learning. A key innovation lies in abandoning the conventional strategy of enforcing alignment between RGB and 3D latent spaces; instead, it employs a normalization scheme and VAE training mechanism specifically tailored for 4D data, effectively transferring diffusion priors. Experiments demonstrate that the approach achieves state-of-the-art performance across multiple benchmarks, improving geometric reconstruction accuracy by 38.64% and motion estimation by 25.0%, all without requiring post-optimization.
📝 Abstract
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page