EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address head-motion interference, sparse asynchronous event modeling difficulty, and insufficient illumination robustness in first-person gesture recognition using event cameras under dynamic scenarios, this paper proposes a lightweight, event-stream-specific network. Methodologically, it introduces: (1) a novel State-Space Context Module that decouples head-motion noise from genuine hand dynamics; (2) a parameter-free Bins-Temporal Shift Module (BSTM) for efficient sparse event fusion; and (3) EgoEvGesture—the first large-scale, first-person event-based gesture dataset. The network employs asymmetric depthwise-separable CNNs and an asynchronous feature encoding mechanism. It achieves 62.7% accuracy (+3.1% over SOTA) on heterogeneous test sets with only 7M parameters, and 96.97% on DVS128 Gesture, demonstrating strong cross-domain generalization. Code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that include events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further build the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy in heterogeneous testing with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 96.97% on DVS128 Gesture, demonstrating strong cross-dataset generalization capability. The dataset and models are made publicly available at https://github.com/3190105222/EgoEv_Gesture.

Problem

Research questions and friction points this paper is trying to address.

Enhance gesture recognition in dynamic, egocentric scenarios.

Overcome limitations of RGB-based methods with event cameras.

Separate head movement noise from gesture dynamics effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight CNN with asymmetric depthwise convolutions

Plug-and-play state-space model for noise decoupling

Parameter-free Bins-Temporal Shift Module (BSTM)

🔎 Similar Papers

No similar papers found.

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)