🤖 AI Summary
Existing event-based semantic segmentation methods struggle to effectively integrate the spatial richness of frame images with the high temporal resolution of event streams, resulting in overly complex models and substantial computational overhead. This paper proposes an efficient hybrid neural network framework that processes event streams using spiking neural networks (SNNs) and frame images using artificial neural networks (ANNs). To enable fine-grained cross-modal collaboration, we introduce three novel modules: adaptive temporal weighting, event-driven sparse feature injection, and channel-wise selective fusion. Our approach achieves state-of-the-art accuracy on DDD17-Seg, DSEC-Semantic, and M3ED-Semantic benchmarks. Notably, on DSEC-Semantic, it reduces energy consumption by 65% while maintaining superior segmentation performance—demonstrating an unprecedented balance between accuracy and energy efficiency.
📝 Abstract
Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.