TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised pretraining methods for event cameras largely adopt RGB-image paradigms, processing only short event sequences and thus severely neglecting long-term temporal modeling—limiting event-aware representation learning. To address this, we propose TESPEC, the first long-sequence self-supervised pretraining framework specifically designed for event data. First, we extend Masked Image Modeling (MIM) to long event sequences and explicitly model cross-frame spatiotemporal dependencies via recurrent neural networks. Second, we introduce high-semantic pseudo-grayscale videos as reconstruction targets, generated by temporally aggregating events to encode long-horizon contextual information. Third, we incorporate noise-robust learning and motion deblurring mechanisms to enhance event reconstruction fidelity. Extensive experiments demonstrate that TESPEC achieves state-of-the-art performance across diverse downstream tasks—including object detection, semantic segmentation, and monocular depth estimation—validating the critical role of long-term temporal modeling in event-based representation learning.

Technology Category

Application Category

📝 Abstract
Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long-term temporal learning for event cameras
Improving self-supervised pre-training for recurrent models
Reducing noise and motion blur in event-based perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training with long event sequences
Masked image modeling with novel reconstruction target
Accumulating events into pseudo grayscale videos