Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

📅 2024-08-27

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 3

career value

199K/year

🤖 AI Summary

To address motion discontinuity and poor temporal consistency in keyframe-based video interpolation, this paper proposes a lightweight bidirectional diffusion sampling framework. Without retraining large-scale models, it fine-tunes pre-trained image-to-video diffusion models (e.g., Sora-like architectures) to enable bidirectional temporal modeling. The method initiates collaborative sampling from both end keyframes and introduces an overlapping estimation fusion strategy to enhance motion plausibility and structural fidelity of intermediate frames. To our knowledge, this is the first work to efficiently adapt unidirectional image-to-video diffusion models for keyframe interpolation. Extensive experiments demonstrate that our approach significantly outperforms optical-flow-based methods and existing diffusion-based interpolation techniques across multiple benchmarks, achieving state-of-the-art performance in visual quality, motion smoothness, and temporal consistency.

Technology Category

Application Category

📝 Abstract

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

Problem

Research questions and friction points this paper is trying to address.

Generating video sequences between keyframes

Adapting image-to-video models

Dual-directional diffusion sampling process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyframe interpolation technique

Dual-directional diffusion sampling

Lightweight fine-tuning adaptation

🔎 Similar Papers

Generalizable Implicit Motion Modeling for Video Frame Interpolation

2024-07-11Neural Information Processing SystemsCitations: 0

World Labs

$250,000 - $325,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco Bay Area, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence