🤖 AI Summary
Event cameras’ asynchronous, sparse, and high-temporal-resolution characteristics pose fundamental limitations for conventional asynchronous-to-synchronous (A2S) representation methods—namely, weak expressivity, poor generalization, and constrained real-time performance. To address these challenges, we propose EVA, an end-to-end asynchronous representation learning framework that introduces linear attention mechanisms and self-supervised language modeling into event-based learning for the first time. EVA employs a streaming encoder operating at the event level, enabling direct “event → vector” mapping without explicit synchronization preprocessing. This design simultaneously achieves high representational capacity, strong generalization, and low inference latency. On DVS128-Gesture and N-Cars classification benchmarks, EVA outperforms existing A2S approaches. Moreover, on the Gen1 object detection task, it achieves 47.7 mAP—marking the first substantive breakthrough of the A2S paradigm in a challenging, high-difficulty detection scenario.
📝 Abstract
Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned representations for ML pipelines, existing A2S approaches often sacrifice representation expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous representation learning), a novel A2S framework to generate highly expressive and generalizable event-by-event representations. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a remarkable 47.7 mAP on the Gen1 dataset. These results underscore EVA's transformative potential for advancing real-time event-based vision applications.