Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing event-based recognition methods rely on fixed-time-interval sampling, rendering them ill-suited for event streams of arbitrary duration (0.1–4.5 s) and limiting generalization across varying event lengths and frequencies—thus failing to exploit the inherent high temporal resolution of event data. To address this, we propose the Path-Adaptive Event Aggregation and Scanning (PEAS) module, enabling dynamic temporal alignment and efficient feature extraction. We further introduce Multi-Faceted Selection-Guided (MSG) loss to strengthen discriminative supervision, and integrate State Space Models (SSMs) to effectively model long-range, high-resolution event sequences. Evaluated on our newly constructed minute-scale dataset ArDVS100, our method achieves improvements of 3.45%, 0.38%, and 8.31% over prior state-of-the-art on DVS Action, SeAct, and HARDVS benchmarks, respectively—demonstrating substantial gains in accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Event cameras are bio-inspired sensors that capture the intensity changes asynchronously and output event streams with distinct advantages, such as high temporal resolution. To exploit event cameras for object/action recognition, existing methods predominantly sample and aggregate events in a second-level duration at every fixed temporal interval (or frequency). However, they often face difficulties in capturing the spatiotemporal relationships for longer, e.g., minute-level, events and generalizing across varying temporal frequencies. To fill the gap, we present a novel framework, dubbed PAST-SSM, exhibiting superior capacity in recognizing events with arbitrary duration (e.g., 0.1s to 4.5s) and generalizing to varying inference frequencies. Our key insight is to learn the spatiotemporal relationships from the encoded event features via the state space model (SSM) -- whose linear complexity makes it ideal for modeling high temporal resolution events with longer sequences. To achieve this goal, we first propose a Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of varying duration into features with fixed dimensions by adaptively scanning and selecting aggregated event frames. On top of PEAS, we introduce a novel Multi-faceted Selection Guiding (MSG) loss to minimize the randomness and redundancy of the encoded features. This subtly enhances the model generalization across different inference frequencies. Lastly, the SSM is employed to better learn the spatiotemporal properties from the encoded features. Moreover, we build a minute-level event-based recognition dataset, named ArDVS100, with arbitrary duration for the benefit of the community. Extensive experiments prove that our method outperforms prior arts by +3.45%, +0.38% and +8.31% on the DVS Action, SeAct and HARDVS datasets, respectively.
Problem

Research questions and friction points this paper is trying to address.

Limited event length processing in existing recognition methods
Poor frequency generalization across varying temporal resolutions
Underutilization of event cameras' high temporal resolution capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively encodes events via state space models
Uses path-selective event aggregation and scan module
Introduces multi-faceted selection guiding loss function
🔎 Similar Papers
No similar papers found.