π€ AI Summary
Existing video inpainting methods rely on manually annotated binary masks to delineate damaged regions, resulting in low efficiency and limited practicality. This paper proposes the first blind video inpainting framework that eliminates the need for explicit binary masks, enabling end-to-end joint learning of damage localization (βwhereβ) and content reconstruction (βhowβ). Our method introduces three key innovations: (1) a mask prediction module guided by semantic discontinuity detection and temporal consistency priors; (2) a mutual constraint loss between mask prediction and inpainting outputs to enforce structural coherence; and (3) a multi-scale spatiotemporal feature fusion architecture, enhanced by self-supervised regularization and hybrid training on both synthetically and naturally corrupted videos. Evaluated on a newly constructed multi-source benchmark, our approach achieves a 23.6% improvement in mask prediction accuracy and sets new state-of-the-art PSNR and SSIM scores, while supporting fully end-to-end, annotation-free deployment.
π Abstract
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the"how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate"whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both"where to inpaint"and"how to inpaint"simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.