๐ค AI Summary
Existing diffusion-based video frame interpolation methods suffer from low accuracy and slow inference due to excessively large denoising ranges in latent space. To address this, we propose Hierarchical Optical Flow Diffusion Modeling (HLFM), the first framework to explicitly formulate bilateral optical flow as a hierarchical diffusion processโthereby decoupling motion estimation from content synthesis. We further design a flow-guided image synthesizer that enables end-to-end generation of high-fidelity intermediate frames. By hierarchically constraining the denoising search space via optical flow priors, HLFM achieves superior modeling precision without sacrificing computational efficiency. On multiple standard benchmarks, HLFM establishes new state-of-the-art performance in frame interpolation. Notably, it accelerates inference by over 10ร compared to existing diffusion-based approaches while preserving strong temporal consistency and visual fidelity.
๐ Abstract
Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: https://hfd-interpolation.github.io.