Scaling Dense Event-Stream Pretraining from Visual Foundation Models

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of semantic scarcity and limited scalability in event stream data, which stem from the high cost of annotation and hinder the learning of fine-grained, general-purpose representations. To overcome this, we propose a self-supervised pretraining approach that, for the first time, leverages the semantic structure of vision foundation models (VFMs) as a strong supervisory signal. Our method employs cross-modal distillation on large-scale synchronized image–event datasets to learn rich event representations, and introduces a structure-aware alignment loss to mitigate semantic collapse caused by discrepancies in sparsity and granularity between images and events. Extensive experiments demonstrate that our approach significantly outperforms existing methods across multiple downstream tasks, exhibiting superior generalization, higher data efficiency, and enhanced transferability.

Technology Category

Application Category

πŸ“ Abstract
Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.
Problem

Research questions and friction points this paper is trying to address.

event-stream
self-supervised pretraining
visual foundation models
representation learning
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised pretraining
visual foundation models
event-stream representation
structure-aware distillation
cross-modal alignment
πŸ”Ž Similar Papers
No similar papers found.