🤖 AI Summary
To address flickering artifacts, quality degradation in long sequences, and computational inefficiency when generating high-frame-rate videos with pre-trained diffusion models, this paper proposes a training-free frame interpolation framework. Methodologically, it introduces a novel noise re-injection mechanism and a sliding-window implicit denoising strategy, integrated with keyframe guidance and latent-space iterative optimization to achieve arbitrary high-FPS synthesis while preserving spatiotemporal consistency. The core contribution lies in temporally aware noise modulation and local-global collaborative optimization—performed without modifying pre-trained model weights—significantly mitigating structural distortions and luminance flickering in fast-motion scenes. Experiments demonstrate superior performance over state-of-the-art zero-shot video interpolation and generation methods in PSNR, LPIPS, and user studies. The approach achieves high visual quality, computational efficiency, and strong generalization, making it suitable for applications demanding stringent temporal fidelity, such as VR and real-time rendering.
📝 Abstract
Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.