DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Video frame prediction in dynamic scenes often suffers from blur and artifacts due to missing information. To address this challenge specifically for event camera data, we propose a diffusion-based framework trained on residual representations for high-fidelity, temporally consistent single-frame synthesis. Our key contributions are: (1) the first event-driven residual-aligned VAE, which implicitly encodes event streams as structured residuals—bypassing explicit optical flow estimation and pixel-level warping; (2) a Diverse-Length Temporal (DLT) augmentation strategy that enhances robustness to varying motion dynamics; and (3) a conditional diffusion denoising module jointly optimizing reconstruction quality and temporal coherence. Extensive evaluations on multiple benchmarks demonstrate substantial improvements over state-of-the-art event- and image-domain frame prediction and interpolation methods. Our synthesized frames exhibit superior sharpness and significantly enhanced temporal consistency.

Technology Category

Application Category

📝 Abstract

Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Problem

Research questions and friction points this paper is trying to address.

Overcomes prediction errors in dynamic scenes for video frame extrapolation.

Addresses holes and blurring from inaccurate pixel displacement in event-based methods.

Ensures temporal consistency in event-driven single-frame synthesis via residual training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based event-driven single-frame synthesis via residual training

Event-to-Residual Alignment Variational Autoencoder for aligning event frames

Diverse-Length Temporal augmentation for improved robustness in training

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches

2024-01-26Citations: 1

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence