🤖 AI Summary
To address the high energy consumption of RGB video surveillance, this paper introduces a novel task—Image-and-Event-to-Video (IE2Video)—aiming to reconstruct high-fidelity, full-frame videos from sparse RGB keyframes and continuous asynchronous event streams. Methodologically, we pioneer the integration of an event encoder into a pretrained text-to-video diffusion model (LTX), employing Low-Rank Adaptation (LoRA) for efficient fine-tuning—replacing conventional autoregressive architectures (e.g., HyperE2VID). Evaluated across multiple event-camera datasets, our approach reduces LPIPS perceptual distortion by 33% (from 0.422 to 0.283), supports reconstruction of sequences up to 128 frames long (with robust performance at 32–128 frames), and significantly enhances visual quality and cross-dataset generalization. This work establishes a new paradigm for low-power visual perception by synergizing event-based sensing with diffusion-based video generation.
📝 Abstract
Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.