Shape of Motion: 4D Reconstruction from a Single Video

📅 2024-07-18

🏛️ arXiv.org

📈 Citations: 135

✨ Influential: 36

career value

226K/year

🤖 AI Summary

This paper addresses the highly ill-posed problem of 4D reconstruction of dynamic scenes from unconstrained monocular video—i.e., recovering temporally coherent 3D geometry with both structural completeness and motion consistency. We propose the first template-free, static-assumption-free explicit SE(3) motion modeling framework. Our key contributions are: (1) an SE(3) motion basis representation enabling soft rigid-body segmentation and long-range motion decoupling; (2) a globally consistent joint optimization over monocular depth, multi-scale optical flow, long-term 2D trajectories, and depth priors; and (3) integration of differentiable rendering with trajectory constraints to enhance geometric and motion fidelity. Our method achieves state-of-the-art performance on long-term 3D/2D motion estimation and novel-view synthesis, significantly improving motion continuity and reconstruction accuracy.

Technology Category

Application Category

📝 Abstract

Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/

Problem

Research questions and friction points this paper is trying to address.

Reconstructing dynamic scenes from monocular videos

Modeling explicit 3D motion trajectories globally

Overcoming limitations of template-dependent quasi-static approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Represents motion with compact SE(3) bases

Combines monocular depth and 2D tracking priors

Enables explicit persistent 3D motion trajectories

🔎 Similar Papers

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion