EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of high-fidelity frame interpolation for event-camera videos under large motion, occlusion, and illumination changes. To this end, we propose the first end-to-end event-image fusion diffusion model. Unlike existing approaches relying on explicit optical flow estimation and motion compensation, our method introduces a hybrid event-frame autoencoder (HAE) and a lightweight spatiotemporal cross-attention (STCA) mechanism, enabling event-driven denoising interpolation directly in the latent space—eliminating hand-crafted intermediate representations. We adopt a two-stage joint training strategy. Evaluated on Vimeo90K-Triplet, our model achieves state-of-the-art PSNR performance, outperforming the best event-based method by 1.98 dB and surpassing diffusion-based baselines by 5.72 dB, while accelerating inference by 4.24×.

Technology Category

Application Category

📝 Abstract
Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.
Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in Video Frame Interpolation under large motion and occlusion
Proposes a diffusion model for high-fidelity interpolation without explicit motion estimation
Improves performance and speed in event-based frame interpolation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Event-Frame Hybrid AutoEncoder for fusion
Employs Spatial-Temporal Cross Attention module
Performs interpolation via latent denoising diffusion
🔎 Similar Papers
No similar papers found.
Hanle Zheng
Hanle Zheng
Department of Precision Instrument, Tsinghua University
bio-inspired machine learning、deep learning
X
Xujie Han
College of Computer Science and Technology, Taiyuan University of Technology, Jinzhong, China
Z
Zegang Peng
Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University, Beijing, China
S
Shangbin Zhang
The Information Science Academy of China Electronics Technology Group Corporation, Beijing, China
G
Guangxun Du
The Information Science Academy of China Electronics Technology Group Corporation, Beijing, China
Zhuo Zou
Zhuo Zou
Fudan University | KTH Sweden
Circuits and SystemsSystem on ChipEmbedded IntelligenceInternet of ThingsAIoT and autonomous systems
X
Xilin Wang
Engineering Laboratory of Power Equipment Reliability in Complicated Coastal Environments, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Jibin Wu
Jibin Wu
The Hong Kong Polytechnic University
Spiking Neural NetworkNeuromorphic ComputingSpeech ProcessingCognitive Modelling
H
Hao Guo
College of Computer Science and Technology, Taiyuan University of Technology, Jinzhong, China
L
Lei Deng
Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University, Beijing, China