MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors

📅 2024-06-01

📈 Citations: 16

✨ Influential: 1

career value

221K/year

🤖 AI Summary

Reconstructing novel views of dynamic scenes from monocular videos captured by static or slowly moving cameras remains challenging. To address this, we propose the first 3D-aware dynamic Gaussian splatting method integrated with single-image depth priors. Our approach introduces three key innovations: (1) a dynamic initialization strategy that leverages single-frame depth estimation as geometric prior to guide Gaussian parameter generation; (2) joint optimization of deformable Gaussians and an implicit deformation field; and (3) a multi-scale robust depth loss enforcing inter-frame depth consistency. Unlike prior methods, ours does not require rapid camera motion. Evaluated on casually captured videos, it achieves a 2.1 dB PSNR improvement over state-of-the-art dynamic NeRF and dynamic Gaussian splatting methods, and—critically—enables high-fidelity dynamic view synthesis under static-camera capture conditions for the first time.

Technology Category

Application Category

📝 Abstract

In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms state-of-the-art meth ods by a significant margin. The code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Rendering dynamic scenes from monocular videos with static/slow cameras

Improving dynamic scene reconstruction using depth estimation

Enhancing novel view synthesis quality for casually captured videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses single-view depth estimation for dynamic scenes

Introduces 3D-aware initialization for deformation fields

Proposes robust depth loss for geometry learning

🔎 Similar Papers

Dynamic Gaussians Mesh: Consistent Mesh Reconstruction from Dynamic Scenes

2024-04-18Citations: 0

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence