OmniEvent: Unified Event Representation Learning

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Event camera data exhibits spatiotemporal irregularity and structural incompleteness, causing existing models to rely heavily on task-specific architectures and suffer from poor generalization. To address this, we propose OmniEvent—the first unified event representation learning framework—based on a “decouple-enhance-fuse” paradigm that transforms raw heterogeneous event streams into regular grid tensors. First, spatiotemporal local features are decoupled; second, spatial-filling curve encoding is employed to enlarge receptive fields and improve computational efficiency; third, attention mechanisms enable adaptive spatiotemporal feature fusion. The resulting representation is directly compatible with standard vision models, eliminating the need for task-specific backbones. Evaluated across three major event-based vision tasks and ten benchmark datasets, OmniEvent consistently surpasses task-specialized state-of-the-art methods, achieving up to a 68.2% performance gain. This work significantly advances the generalization and practical deployment of event-based vision systems.

Technology Category

Application Category

📝 Abstract
Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in https://github.com/Wickyan/OmniEvent .
Problem

Research questions and friction points this paper is trying to address.

Unified event representation learning for diverse tasks
Eliminating task-specific designs in event networks
Handling spatial-temporal inhomogeneity in event data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouple-enhance-fuse paradigm for event data
Space-filling curves for efficient large receptive fields
Grid-shaped tensor output for standard vision models
🔎 Similar Papers
No similar papers found.
W
Weiqi Yan
Fujian Key Lab of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China
C
Chenlu Lin
Fujian Key Lab of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China
Y
Youbiao Wang
Fujian Key Lab of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China
Z
Zhipeng Cai
Meta AI
X
Xiuhong Lin
Fujian Key Lab of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China
Yangyang Shi
Yangyang Shi
Meta
natural language processinglanguage modelingspeech recognition
W
Weiquan Liu
College of Computer Engineering, Jimei University, Xiamen, China
Y
Yu Zang
Fujian Key Lab of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China