IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the high energy consumption of RGB video surveillance, this paper introduces a novel task—Image-and-Event-to-Video (IE2Video)—aiming to reconstruct high-fidelity, full-frame videos from sparse RGB keyframes and continuous asynchronous event streams. Methodologically, we pioneer the integration of an event encoder into a pretrained text-to-video diffusion model (LTX), employing Low-Rank Adaptation (LoRA) for efficient fine-tuning—replacing conventional autoregressive architectures (e.g., HyperE2VID). Evaluated across multiple event-camera datasets, our approach reduces LPIPS perceptual distortion by 33% (from 0.422 to 0.283), supports reconstruction of sequences up to 128 frames long (with robust performance at 32–128 frames), and significantly enhances visual quality and cross-dataset generalization. This work establishes a new paradigm for low-power visual perception by synergizing event-based sensing with diffusion-based video generation.

Technology Category

Application Category

📝 Abstract

Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.

Problem

Research questions and friction points this paper is trying to address.

Reconstruct RGB video from sparse keyframes and event streams

Adapt pretrained diffusion models for event-based video generation

Reduce power consumption while maintaining standard video output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting pretrained diffusion models for event-based video reconstruction

Injecting event representations via learned encoders and low-rank adaptation

Hybrid capture paradigm combining sparse RGB keyframes with event streams

🔎 Similar Papers

DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models