🤖 AI Summary
Existing video frame interpolation methods suffer from motion blur due to motion trajectory uncertainty (e.g., acceleration, deceleration, curved/linear motion), leading to multi-solution averaging—particularly severe in intermediate frames or long-range motion where directional ambiguity exacerbates distortion. To address this, we propose a novel frame interpolation paradigm based on **distance indexing**, replacing conventional time indexing and decomposing long-range motion into iterative short-range predictions. Our approach models pixel-wise motion magnitude via distance mapping, jointly leveraging iterative reference estimation and a multi-frame continuous distance estimator to decouple motion dynamics from appearance. It further enables object-level temporal retiming and fine-grained editing. Crucially, arbitrary-time interpolation incurs zero computational overhead. Extensive evaluation demonstrates significant improvements in perceptual quality, with consistent quantitative gains (PSNR, SSIM, LPIPS) and qualitative superiority over state-of-the-art methods. The framework also generalizes effectively to controllable video editing tasks.
📝 Abstract
Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed"distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. Moreover, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames, due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing without requiring extra computation. Furthermore, we demonstrate that if additional latency is acceptable, a continuous map estimator can be employed to compute a pixel-wise dense distance indexing using multiple nearby frames. Combined with efficient multi-frame refinement, this extension can further disambiguate complex motion, thus enhancing performance both qualitatively and quantitatively. Additionally, the ability to manually specify distance indexing allows for independent temporal manipulation of each object, providing a novel tool for video editing tasks such as re-timing.