Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in daily dietary monitoring—including difficult data acquisition, privacy sensitivity, and environmental interference—this paper proposes an automatic gesture recognition method for eating and drinking based on human skeletal sequences. The approach innovatively integrates a sparse spatio-temporal graph convolutional network (ST-GCN) with a bidirectional long short-term memory (BiLSTM) network to jointly model spatial skeletal topology and capture temporal dynamics. To our knowledge, this is the first work to combine sparse ST-GCN with BiLSTM for ingestion gesture detection, significantly enhancing cross-device and cross-environment generalization. It enables joint modeling across controlled laboratory settings and real-world smartphone deployments. On the OREBA dataset, the method achieves F1-scores of 86.18% for eating and 74.84% for drinking; when transferred to a custom smartphone-collected dataset, it maintains strong performance at 85.40% and 67.80%, respectively—demonstrating robustness and practical applicability.

Technology Category

Application Category

📝 Abstract
Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.
Problem

Research questions and friction points this paper is trying to address.

Detect food intake gestures using skeleton data
Improve dietary monitoring with automated gesture recognition
Validate model robustness across different datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-temporal graph convolutional network for gesture detection
BiLSTM framework enhances temporal pattern recognition
Skeleton-based method ensures privacy and robustness
🔎 Similar Papers
No similar papers found.
C
Chunzhuo Wang
e-Media Research Lab, and also with the ESAT-STADIUS Division, KU Leuven, 3000 Leuven, Belgium
Z
Zhewen Xue
e-Media Research Lab, and also with the ESAT-STADIUS Division, KU Leuven, 3000 Leuven, Belgium
T
T. Sunil Kumar
University of Gåvle, 801 76 Gåvle, Sweden
G
Guido Camps
Division of Human Nutrition and Health, Department of Agrotechnology and Food Sciences, Wageningen University and Research, 6700EA Wageningen, and also with the OnePlanet Research Center, 6708WE Wageningen, The Netherlands
Hans Hallez
Hans Hallez
KU Leuven, Campus Brugge
Biomedical engineeringSignal- and ImageprocessingNetworked embedded systemsInternet of ThingsMechatronics
Bart Vanrumste
Bart Vanrumste
KU Leuven, ESAT, STADIUS@GroepT
telehealthbiomedical engineeringsignal processingimage processingmachine learning