Efficient Sensor Fusion for Gesture Recognition on Resource-Constrained Devices

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenges of high power consumption, latency, and privacy concerns in vision-based gesture recognition on resource-constrained devices such as smart glasses by proposing a lightweight multimodal fusion approach tailored for microcontrollers (MCUs). Leveraging a synergistic combination of an 8×8 low-resolution time-of-flight sensor (VL53L8CH) and an infrared thermal imaging array (AMG8833), the method introduces a novel grouped convolution architecture that enables efficient data fusion with only 6,343 parameters while maintaining high performance. Evaluated on a seven-class static gesture dataset, the system achieves 92.3% accuracy and a macro F1-score of 0.93. Real-world deployment on an STM32 platform demonstrates millisecond-level inference latency and ultra-low power consumption of merely 50 mW, effectively balancing efficiency, accuracy, and on-device privacy preservation.

📝 Abstract

Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system's suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.

Problem

Research questions and friction points this paper is trying to address.

gesture recognition

resource-constrained devices

sensor fusion

human-computer interaction

privacy-preserving

Innovation

Methods, ideas, or system contributions that make the work stand out.

sensor fusion

lightweight CNN

ToF and IR sensors