Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

πŸ“… 2025-10-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Event cameras lack texture and color cues, hindering open-vocabulary object detection. Existing methods are constrained by predefined categories, while direct adaptation of vision-language models (e.g., CLIP) suffers from severe modality gaps between event streams and RGB images. To address this, we propose an open-vocabulary event detection framework comprising three key components: (1) an adaptive spike-driven event stream slicing mechanism, where a spiking neural network (SNN) dynamically partitions temporal sequences; (2) a spatial-attention-guided knowledge distillation framework that effectively transfers CLIP’s image-domain semantic priors to an event-based student network; and (3) a hybrid SNN-CNN architecture jointly optimizing temporal modeling and feature representation. Evaluated on public event datasets, our method significantly improves detection accuracy for both known and novel categories. It marks the first successful and efficient transfer of CLIP knowledge to the event modality, outperforming fixed-grouping and other baseline strategies.

Technology Category

Application Category

πŸ“ Abstract
Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP's semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP's rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP's broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.
Problem

Research questions and friction points this paper is trying to address.

Achieving open-vocabulary object detection with event cameras lacking texture information
Bridging modality gap between event streams and vision-language models like CLIP
Preventing temporal information loss in adaptive event stream segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language knowledge distillation bridges modality gap
Hybrid SNN-CNN framework adaptively segments event streams
Spatial attention distillation transfers CLIP knowledge to events
πŸ”Ž Similar Papers
No similar papers found.
J
Jinchang Zhang
Intelligent Vision and Sensing (IVS) Lab at SUNY Binghamton, USA
Z
Zijun Li
Intelligent Vision and Sensing (IVS) Lab at SUNY Binghamton, USA
Jiakai Lin
Jiakai Lin
University of Georgia
Computer Vision
Guoyu Lu
Guoyu Lu
SUNY Binghamton
RoboticsComputer VisionMachine Learning