🤖 AI Summary
Conventional RGB cameras suffer from motion blur in high-dynamic scenes, while event cameras produce sparse, textureless event streams, hindering robust pixel-level instance segmentation. Method: We propose the first end-to-end multimodal instance segmentation framework for spatiotemporally aligned event–grayscale data. It introduces a joint event–image spatiotemporal encoder that fuses voxelized event sequences with RGB frame temporal features, coupled with a cross-modal attention-driven pixel-wise instance disentanglement mechanism. Contribution/Results: Evaluated on two newly established benchmarks—E2VID-Inst and DSEC-Inst—our method achieves the first event-driven pixel-level instance segmentation, improving mAP by 12.6% over prior approaches while maintaining inference latency below 35 ms. This work advances low-latency, high-robustness dynamic vision perception and establishes a new paradigm for real-time scene understanding in autonomous driving and robotics.
📝 Abstract
We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md