EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper addresses the challenge of high-fidelity frame interpolation for event-camera videos under large motion, occlusion, and illumination changes. To this end, we propose the first end-to-end event-image fusion diffusion model. Unlike existing approaches relying on explicit optical flow estimation and motion compensation, our method introduces a hybrid event-frame autoencoder (HAE) and a lightweight spatiotemporal cross-attention (STCA) mechanism, enabling event-driven denoising interpolation directly in the latent space—eliminating hand-crafted intermediate representations. We adopt a two-stage joint training strategy. Evaluated on Vimeo90K-Triplet, our model achieves state-of-the-art PSNR performance, outperforming the best event-based method by 1.98 dB and surpassing diffusion-based baselines by 5.72 dB, while accelerating inference by 4.24×.

Technology Category

Application Category

📝 Abstract

Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in Video Frame Interpolation under large motion and occlusion

Proposes a diffusion model for high-fidelity interpolation without explicit motion estimation

Improves performance and speed in event-based frame interpolation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Event-Frame Hybrid AutoEncoder for fusion

Employs Spatial-Temporal Cross Attention module

Performs interpolation via latent denoising diffusion

🔎 Similar Papers

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation