🤖 AI Summary
This work addresses the challenge of recovering high-quality, spatiotemporally consistent 4D dynamic objects from monocular videos, which is hindered by data scarcity and viewpoint ambiguity. The authors propose decomposing 4D synthesis into static 3D shape generation and motion reconstruction, introducing a canonical reference mesh to learn a compact implicit motion representation. A frame-wise Transformer predicts per-frame vertex trajectories, enabling geometrically consistent dynamic reconstruction. The method employs a feed-forward architecture with a scalable Transformer, supporting processing of sequences of arbitrary length. Evaluated on standard benchmarks as well as a newly constructed high-fidelity ground-truth dataset, the approach outperforms existing methods in both geometric fidelity and spatiotemporal consistency.
📝 Abstract
We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.