🤖 AI Summary
This work addresses the challenges of high power consumption, latency, and privacy concerns in vision-based gesture recognition on resource-constrained devices such as smart glasses by proposing a lightweight multimodal fusion approach tailored for microcontrollers (MCUs). Leveraging a synergistic combination of an 8×8 low-resolution time-of-flight sensor (VL53L8CH) and an infrared thermal imaging array (AMG8833), the method introduces a novel grouped convolution architecture that enables efficient data fusion with only 6,343 parameters while maintaining high performance. Evaluated on a seven-class static gesture dataset, the system achieves 92.3% accuracy and a macro F1-score of 0.93. Real-world deployment on an STM32 platform demonstrates millisecond-level inference latency and ultra-low power consumption of merely 50 mW, effectively balancing efficiency, accuracy, and on-device privacy preservation.
📝 Abstract
Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system's suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.