🤖 AI Summary
This work addresses the challenge of training transferable vision foundation models from event camera data, which suffers from scarce annotations and a unique sensing modality. To this end, the authors propose GEP, a two-stage framework: first, an event encoder is aligned with a frozen general-purpose vision foundation model via a regression-contrastive joint loss; second, a Transformer backbone is pretrained autoregressively on hybrid event–image sequences to capture the distinctive temporal dynamics of events. By uniquely integrating semantic alignment with generative sequence modeling, GEP learns rich, temporally sensitive universal event representations. The resulting model significantly outperforms existing event-based pretraining approaches across diverse downstream tasks—including object recognition, segmentation, and depth estimation—and demonstrates exceptional cross-domain generalization capabilities.
📝 Abstract
Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.