BVINet: Unlocking Blind Video Inpainting with Zero Annotations

πŸ“… 2025-02-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video inpainting methods rely on manually annotated binary masks to delineate damaged regions, resulting in low efficiency and limited practicality. This paper proposes the first blind video inpainting framework that eliminates the need for explicit binary masks, enabling end-to-end joint learning of damage localization (β€œwhere”) and content reconstruction (β€œhow”). Our method introduces three key innovations: (1) a mask prediction module guided by semantic discontinuity detection and temporal consistency priors; (2) a mutual constraint loss between mask prediction and inpainting outputs to enforce structural coherence; and (3) a multi-scale spatiotemporal feature fusion architecture, enhanced by self-supervised regularization and hybrid training on both synthetically and naturally corrupted videos. Evaluated on a newly constructed multi-source benchmark, our approach achieves a 23.6% improvement in mask prediction accuracy and sets new state-of-the-art PSNR and SSIM scores, while supporting fully end-to-end, annotation-free deployment.

Technology Category

Application Category

πŸ“ Abstract
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the"how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate"whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both"where to inpaint"and"how to inpaint"simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
Problem

Research questions and friction points this paper is trying to address.

Video Restoration
Automatic Damage Detection
Blind Spot Repair
Innovation

Methods, ideas, or system contributions that make the work stand out.

BVINet
Blind Damage Detection
Simultaneous Restoration and Detection Optimization
πŸ”Ž Similar Papers
No similar papers found.
Zhiliang Wu
Zhiliang Wu
Research Scientist, Siemens Technology
Representation learningMachine learningGaussian ProcessesHealthcare
K
Kerui Chen
ReLER Lab, CCAI, Zhejiang University
K
Kun Li
ReLER Lab, CCAI, Zhejiang University
Hehe Fan
Hehe Fan
Zhejiang University
Deep learningComputer visionMultimediaAI for science
Y
Yi Yang
ReLER Lab, CCAI, Zhejiang University