🤖 AI Summary
Reconstructing novel views of dynamic scenes from monocular videos captured by static or slowly moving cameras remains challenging. To address this, we propose the first 3D-aware dynamic Gaussian splatting method integrated with single-image depth priors. Our approach introduces three key innovations: (1) a dynamic initialization strategy that leverages single-frame depth estimation as geometric prior to guide Gaussian parameter generation; (2) joint optimization of deformable Gaussians and an implicit deformation field; and (3) a multi-scale robust depth loss enforcing inter-frame depth consistency. Unlike prior methods, ours does not require rapid camera motion. Evaluated on casually captured videos, it achieves a 2.1 dB PSNR improvement over state-of-the-art dynamic NeRF and dynamic Gaussian splatting methods, and—critically—enables high-fidelity dynamic view synthesis under static-camera capture conditions for the first time.
📝 Abstract
In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms state-of-the-art meth ods by a significant margin. The code will be publicly available.