Generative Event Pretraining with Foundation Model Alignment

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training transferable vision foundation models from event camera data, which suffers from scarce annotations and a unique sensing modality. To this end, the authors propose GEP, a two-stage framework: first, an event encoder is aligned with a frozen general-purpose vision foundation model via a regression-contrastive joint loss; second, a Transformer backbone is pretrained autoregressively on hybrid event–image sequences to capture the distinctive temporal dynamics of events. By uniquely integrating semantic alignment with generative sequence modeling, GEP learns rich, temporally sensitive universal event representations. The resulting model significantly outperforms existing event-based pretraining approaches across diverse downstream tasks—including object recognition, segmentation, and depth estimation—and demonstrates exceptional cross-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract
Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
Problem

Research questions and friction points this paper is trying to address.

event camera
visual foundation model
limited labeled data
cross-task transfer
event-based vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

event camera
foundation model alignment
generative pretraining
temporal dynamics
visual foundation model
🔎 Similar Papers
No similar papers found.