🤖 AI Summary
This paper addresses the challenge of inaccurate interpolation time estimation in traditional hand-drawn animation video frame interpolation (VFI), primarily caused by abrupt motion changes. To this end, we propose a time-adaptive lightweight residual diffusion framework. Methodologically, it explicitly models and adaptively learns the optimal interpolation time—the first such approach in VFI—and integrates the ResShift diffusion paradigm, enabling high-fidelity intermediate frame synthesis in only ~10 denoising steps. Furthermore, it incorporates optical flow-guided feature alignment and pixel-wise uncertainty modeling to yield interpretable confidence maps. Extensive experiments on animation benchmarks demonstrate significant improvements over state-of-the-art methods, achieving an effective balance among high-fidelity interpolation quality, rapid inference (≈10 steps), and physically grounded interpretability.
📝 Abstract
In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos.