Efficient Sensor Fusion for Gesture Recognition on Resource-Constrained Devices

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenges of high power consumption, latency, and privacy concerns in vision-based gesture recognition on resource-constrained devices such as smart glasses by proposing a lightweight multimodal fusion approach tailored for microcontrollers (MCUs). Leveraging a synergistic combination of an 8×8 low-resolution time-of-flight sensor (VL53L8CH) and an infrared thermal imaging array (AMG8833), the method introduces a novel grouped convolution architecture that enables efficient data fusion with only 6,343 parameters while maintaining high performance. Evaluated on a seven-class static gesture dataset, the system achieves 92.3% accuracy and a macro F1-score of 0.93. Real-world deployment on an STM32 platform demonstrates millisecond-level inference latency and ultra-low power consumption of merely 50 mW, effectively balancing efficiency, accuracy, and on-device privacy preservation.
📝 Abstract
Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system's suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.
Problem

Research questions and friction points this paper is trying to address.

gesture recognition
resource-constrained devices
sensor fusion
human-computer interaction
privacy-preserving
Innovation

Methods, ideas, or system contributions that make the work stand out.

sensor fusion
lightweight CNN
ToF and IR sensors
on-device inference
gesture recognition
🔎 Similar Papers
No similar papers found.