EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address motion blur and temporal inconsistency caused by large displacements in video frame interpolation under highly dynamic scenes, this paper proposes an enhanced diffusion-based framework. Methodologically, it introduces three key innovations: (1) a novel Transformer-based latent-space tokenizer that efficiently encodes high-fidelity intermediate-frame representations; (2) a frame-difference embedding mechanism that explicitly models nonlinear motion priors from the first and last frames; and (3) a stride-aware temporal self-attention module to strengthen long-range temporal consistency modeling. Quantitative evaluation demonstrates state-of-the-art performance: LPIPS improves by nearly 10% on DAVIS and SNU-FILM, while PSNR/SSIM gains reach 8% on the DAIN-HD benchmark. Qualitatively, the method significantly enhances structural fidelity and motion coherence, particularly in challenging large-motion scenarios.

Technology Category

Application Category

📝 Abstract

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

Problem

Research questions and friction points this paper is trying to address.

Handles complex, nonlinear motion in video frame interpolation.

Improves sharpness and temporal consistency in large-motion scenarios.

Achieves state-of-the-art results on popular benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based tokenizer for refined latent representations

Enhanced diffusion transformer with temporal attention

Start-end frame difference embedding for dynamic motion guidance

🔎 Similar Papers

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation