🤖 AI Summary
To address incomplete reconstruction in monocular video-based 4D dynamic scene reconstruction—caused by limited single-view observations and large depth estimation errors—this paper proposes a video-inpainting-inspired view augmentation method that jointly leverages geometric and generative priors. We formulate multi-view synthesis as a spatiotemporally consistent video completion task conditioned on monocular depth priors. Our approach introduces an iterative view augmentation strategy coupled with a robust reconstruction loss, integrating optical-flow-guided view warping, synthetic mask training, and joint depth-motion optimization. To our knowledge, this is the first work to enable end-to-end co-modeling of geometric constraints (depth and optical flow) and generative priors (video inpainting), significantly improving both reconstruction completeness and spatiotemporal consistency. Extensive experiments demonstrate state-of-the-art performance across multiple dynamic scene benchmarks, with particularly notable quality gains in occluded regions and along motion boundaries.
📝 Abstract
Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.