🤖 AI Summary
Existing image-to-video diffusion models (I2V-DMs) suffer from asymmetric endpoint constraints in frame interpolation: the terminal frame exerts significantly weaker control than the initial frame, leading to motion discontinuity and appearance collapse. To address this, we propose a Symmetric Dual-Endpoint Constraint Framework. Its core innovation is a lightweight EF-Net module that enables end-to-end, frame-adaptive bidirectional feature injection, ensuring strict equivalence between terminal and initial frame conditioning strengths. Additionally, we introduce a time-adaptive feature fusion mechanism that overcomes the unidirectional conditional modeling limitation inherent in conventional I2V models. Evaluated on multiple benchmarks, our method achieves substantial improvements in interpolation quality: generated intermediate frames exhibit smoother motion trajectories and higher content consistency, effectively mitigating motion breaks and structural collapse.
📝 Abstract
Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame constraint usually utilize the same mechanism that originally imposed the start-frame (single image) constraint. However, since the original I2V-DMs are adequately trained for the start-frame condition in advance, naively introducing the end-frame constraint by the same mechanism with much less (even zero) specialized training probably can't make the end frame have a strong enough impact on the intermediate content like the start frame. This asymmetric control strength of the two frames over the intermediate content likely leads to inconsistent motion or appearance collapse in generated frames. To efficiently achieve symmetric constraints of start and end frames, we propose a novel framework, termed Sci-Fi, which applies a stronger injection for the constraint of a smaller training scale. Specifically, it deals with the start-frame constraint as before, while introducing the end-frame constraint by an improved mechanism. The new mechanism is based on a well-designed lightweight module, named EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. This makes the end-frame constraint as strong as the start-frame constraint, enabling our Sci-Fi to produce more harmonious transitions in various scenarios. Extensive experiments prove the superiority of our Sci-Fi compared with other baselines.