🤖 AI Summary
Neuromorphic vision systems (e.g., event cameras) produce binary, spatially sparse event streams with rich temporal resolution but severely limited spatial information. To address this modality gap, we propose the first generative cross-modal fusion framework specifically designed for event streams. Our approach innovatively introduces latent-space manipulation from generative modeling into event processing, constructing a hybrid architecture that integrates diffusion models and conditional variational autoencoders. It jointly models events and RGB frames in a shared latent space via event encoding, spatiotemporal alignment, and multimodal conditional generation. Evaluated on event-driven deblurring, sparse-to-dense optical flow reconstruction, and high-speed novel-view synthesis, our method achieves state-of-the-art performance—improving PSNR by 3.2–5.8 dB over prior works—and demonstrates superior temporal fidelity compared to interpolation- and CNN-based approaches.
📝 Abstract
Neuromorphic Visual Systems, such as spike cameras, have attracted considerable attention due to their ability to capture clear textures under dynamic conditions. This capability effectively mitigates issues related to motion and aperture blur. However, in contrast to conventional RGB modalities that provide dense spatial information, these systems generate binary, spatially sparse frames as a trade-off for temporally rich visual streams. In this context, generative models emerge as a promising solution to address the inherent limitations of sparse data. These models not only facilitate the conditional fusion of existing information from both spike and RGB modalities but also enable the conditional generation based on latent priors. In this study, we introduce a robust generative processing framework named SpikeGen, designed for visual spike streams captured by spike cameras. We evaluate this framework across multiple tasks involving mixed spike-RGB modalities, including conditional image/video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by comprehensive experimental results, we demonstrate that leveraging the latent space operation abilities of generative models allows us to effectively address the sparsity of spatial information while fully exploiting the temporal richness of spike streams, thereby promoting a synergistic enhancement of different visual modalities.