EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address motion blur and temporal inconsistency caused by large displacements in video frame interpolation under highly dynamic scenes, this paper proposes an enhanced diffusion-based framework. Methodologically, it introduces three key innovations: (1) a novel Transformer-based latent-space tokenizer that efficiently encodes high-fidelity intermediate-frame representations; (2) a frame-difference embedding mechanism that explicitly models nonlinear motion priors from the first and last frames; and (3) a stride-aware temporal self-attention module to strengthen long-range temporal consistency modeling. Quantitative evaluation demonstrates state-of-the-art performance: LPIPS improves by nearly 10% on DAVIS and SNU-FILM, while PSNR/SSIM gains reach 8% on the DAIN-HD benchmark. Qualitatively, the method significantly enhances structural fidelity and motion coherence, particularly in challenging large-motion scenarios.

Technology Category

Application Category

📝 Abstract
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
Problem

Research questions and friction points this paper is trying to address.

Handles complex, nonlinear motion in video frame interpolation.
Improves sharpness and temporal consistency in large-motion scenarios.
Achieves state-of-the-art results on popular benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based tokenizer for refined latent representations
Enhanced diffusion transformer with temporal attention
Start-end frame difference embedding for dynamic motion guidance
🔎 Similar Papers
No similar papers found.
Zihao Zhang
Zihao Zhang
天津大学
计算机视觉
H
Haoran Chen
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University; Shanghai Collaborative Innovation Center of Intelligent Visual Computing
H
Haoyu Zhao
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University; Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Guansong Lu
Guansong Lu
ByteDance
image/video generation/edit3d generationmultimodal
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
H
Hang Xu
Noah’s Ark Lab, Huawei
Zuxuan Wu
Zuxuan Wu
Fudan University