Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge that asynchronous event streams from event cameras lack dense intensity information, making them unsuitable for conventional vision tasks, and that existing video reconstruction methods struggle with detail recovery and artifact suppression. To overcome these limitations, the authors propose MSFET-E2V, a novel model that integrates spatiotemporal features with frequency-domain information extracted via discrete wavelet transform. The approach introduces a cross-domain attention mechanism and a lightweight wavelet-enhanced skip module to jointly model local details and global structure in the spatial-frequency domain. Built upon a multi-scale frequency-enhanced Transformer architecture, the method achieves superior reconstruction quality on multiple real-world datasets while significantly reducing model parameters, GPU memory consumption, and inference time compared to state-of-the-art approaches.

📝 Abstract

Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

Problem

Research questions and friction points this paper is trying to address.

Event-to-Video Reconstruction

Event Cameras

Reconstruction Artifacts

Structural Detail Recovery

Spatio-Temporal Features

Innovation

Methods, ideas, or system contributions that make the work stand out.

event-to-video reconstruction

frequency-enhanced transformer

cross-domain attention