Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing event-based semantic segmentation methods struggle to effectively integrate the spatial richness of frame images with the high temporal resolution of event streams, resulting in overly complex models and substantial computational overhead. This paper proposes an efficient hybrid neural network framework that processes event streams using spiking neural networks (SNNs) and frame images using artificial neural networks (ANNs). To enable fine-grained cross-modal collaboration, we introduce three novel modules: adaptive temporal weighting, event-driven sparse feature injection, and channel-wise selective fusion. Our approach achieves state-of-the-art accuracy on DDD17-Seg, DSEC-Semantic, and M3ED-Semantic benchmarks. Notably, on DSEC-Semantic, it reduces energy consumption by 65% while maintaining superior segmentation performance—demonstrating an unprecedented balance between accuracy and energy efficiency.

Technology Category

Application Category

📝 Abstract

Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.

Problem

Research questions and friction points this paper is trying to address.

Exploiting frame-event fusion for semantic segmentation

Reducing computational costs in event-based segmentation

Enhancing accuracy with hybrid neural network approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Spiking and Artificial Neural Network framework

Adaptive Temporal Weighting Injector for dynamic features

Channel Selection Fusion optimizes feature integration

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation