Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address incomplete reconstruction in monocular video-based 4D dynamic scene reconstruction—caused by limited single-view observations and large depth estimation errors—this paper proposes a video-inpainting-inspired view augmentation method that jointly leverages geometric and generative priors. We formulate multi-view synthesis as a spatiotemporally consistent video completion task conditioned on monocular depth priors. Our approach introduces an iterative view augmentation strategy coupled with a robust reconstruction loss, integrating optical-flow-guided view warping, synthetic mask training, and joint depth-motion optimization. To our knowledge, this is the first work to enable end-to-end co-modeling of geometric constraints (depth and optical flow) and generative priors (video inpainting), significantly improving both reconstruction completeness and spatiotemporal consistency. Extensive experiments demonstrate state-of-the-art performance across multiple dynamic scene benchmarks, with particularly notable quality gains in occluded regions and along motion boundaries.

Technology Category

Application Category

📝 Abstract

Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 4D reconstruction from monocular videos

Integrating geometric and generative priors for view synthesis

Mitigating inaccuracies in monocular depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates geometric and generative priors

Uses video inpainting for view augmentation

Trains model with synthetic occlusion masks

🔎 Similar Papers

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion