Infusion: Internal Diffusion for Inpainting of Dynamic Textures and Complex Motion

📅 2023-11-02

🏛️ Computer graphics forum (Print)

📈 Citations: 1

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Video inpainting faces dual challenges: modeling high-dimensional spatiotemporal data and preserving temporal coherence, while existing diffusion-based approaches suffer from prohibitive computational overhead. This paper proposes a lightweight, single-video self-supervised diffusion framework that requires no external training data—instead leveraging intrinsic spatiotemporal self-similarity within the input video to construct a task-specific prior. Our key contributions are threefold: (1) the first single-video self-supervised training paradigm for video diffusion; (2) a segment-wise noise interval modeling mechanism that drastically reduces parameter count and inference cost; and (3) a self-similarity-driven dynamic prior formulation. With only 0.5 million parameters, our model achieves state-of-the-art or superior performance on challenging dynamic texture and complex motion inpainting tasks, delivering both high-fidelity reconstruction and strong temporal consistency.

📝 Abstract

Video inpainting is the task of filling a region in a video in a visually convincing manner It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto‐similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. With this internal learning approach, where the training data is limited to a single video, our lightweight models perform very well with only half a million parameters, in contrast to the very large networks with billions of parameters typically found in the literature. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. We show qualitative and quantitative results, demonstrating that our method reaches or exceeds state of the art performance in the case of dynamic textures and complex dynamic backgrounds.

Problem

Research questions and friction points this paper is trying to address.

Video inpainting for dynamic textures and motion

Reducing computational load in diffusion models

Internal learning with single-video training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal learning with single video data

Lightweight model with 500k parameters

Split diffusion process into intervals

🔎 Similar Papers

No similar papers found.