π€ AI Summary
This work addresses key challenges in natural gesture recognition for wearable devices (e.g., smart glasses): poor robustness, high power consumption, and weak cross-user/environment generalization. We propose the first ultra-low-power micro-gesture recognition system tailored for event cameras. Our approach introduces a human-centered gesture set comprising thumb swipes and dual-pinch micro-gestures, and designs a domain-sampling simulation training framework based solely on synthetic dataβenabling zero-shot cross-user and cross-environment generalization without real-world annotation. The system integrates a lightweight temporal neural network with Hexagon DSP hardware acceleration. It achieves 6β8 mW power consumption, with F1 scores exceeding 70% (2-channel) and 80% (6-channel), outperforming state-of-the-art methods by +20% accuracy and β25Γ power reduction. To our knowledge, this is the first event-driven gesture recognition system achieving high accuracy, ultra-low power, and calibration-free operation.
π Abstract
We present an advance in wearable technology: a mobile-optimized, real-time, ultra-low-power event camera system that enables natural hand gesture control for smart glasses, dramatically improving user experience. While hand gesture recognition in computer vision has advanced significantly, critical challenges remain in creating systems that are intuitive, adaptable across diverse users and environments, and energy-efficient enough for practical wearable applications. Our approach tackles these challenges through carefully selected microgestures: lateral thumb swipes across the index finger (in both directions) and a double pinch between thumb and index fingertips. These human-centered interactions leverage natural hand movements, ensuring intuitive usability without requiring users to learn complex command sequences. To overcome variability in users and environments, we developed a novel simulation methodology that enables comprehensive domain sampling without extensive real-world data collection. Our power-optimised architecture maintains exceptional performance, achieving F1 scores above 80% on benchmark datasets featuring diverse users and environments. The resulting models operate at just 6-8 mW when exploiting the Qualcomm Snapdragon Hexagon DSP, with our 2-channel implementation exceeding 70% F1 accuracy and our 6-channel model surpassing 80% F1 accuracy across all gesture classes in user studies. These results were achieved using only synthetic training data. This improves on the state-of-the-art for F1 accuracy by 20% with a power reduction 25x when using DSP. This advancement brings deploying ultra-low-power vision systems in wearable devices closer and opens new possibilities for seamless human-computer interaction.