🤖 AI Summary
Existing asynchronous event-based object detection methods struggle to balance training efficiency over long sequences with per-event inference latency, and global state updates often lead to a trade-off between accuracy and efficiency. This work proposes Spatially Sparse Linear Attention (SSLA), which introduces state-level sparsity into the linear attention mechanism for the first time. By integrating hybrid spatial state decomposition with a “scatter-compute-gather” training pipeline, SSLA enables sparse activation while preserving training parallelism, yielding an end-to-end asynchronous detector named SSLADet. The method achieves state-of-the-art asynchronous results with mAP scores of 0.375 on Gen1 and 0.515 on N-Caltech101, while reducing per-event computational cost by over 20× compared to the strongest baseline.
📝 Abstract
Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.