🤖 AI Summary
To address poor generalization and limited performance in event-camera video frame interpolation (EVFI) caused by scarce annotated data, this paper pioneers the adaptation of internet-scale pre-trained video diffusion models to EVFI. We propose an event-frame cross-modal feature alignment mechanism and an unpaired motion modeling strategy, enabling robust temporal interpolation without pixel-level event-image correspondence labels. Through lightweight fine-tuning tailored to sparse event streams and keyframes, the model achieves significantly improved cross-device generalization. Extensive evaluation on multiple real-world EVFI benchmarks—including a newly constructed dataset—demonstrates consistent superiority over state-of-the-art methods. Notably, cross-camera testing yields a PSNR improvement of up to 2.1 dB, validating the effective transfer of large-scale generative priors to low-resource discriminative tasks.
📝 Abstract
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.