🤖 AI Summary
This paper addresses the challenge of real-time deformable 3D Gaussian reconstruction from monocular video in dynamic scenes—overcoming the limitations of existing methods that assume static scenes or rigid motion. We propose the first feed-forward monocular dynamic 3D Gaussian splatting framework, featuring: (1) a large-scale synthetic dataset supervised by dense 3D scene flow; (2) a per-pixel parameterized deformable 3D Gaussian representation; and (3) a large Transformer architecture enabling long-range 3D tracking. By jointly leveraging multi-view consistency supervision and scene-flow-guided training, our method achieves high-fidelity, physically plausible deformation modeling. Experiments demonstrate reconstruction accuracy on par with optimization-based approaches and substantial gains over prior feed-forward dynamic reconstruction methods. Moreover, our 3D deformation prediction and monocular 3D tracking performance achieve state-of-the-art results.
📝 Abstract
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.