Advanced Gesture Recognition in Autism: Integrating YOLOv7, Video Augmentation and VideoMAE for Video Analysis

📅 2024-10-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fine-grained recognition of repetitive motor stereotypies (e.g., spinning, head-banging, arm-flapping) exhibited by children with autism spectrum disorder (ASD) in naturalistic settings. We propose a novel self-supervised learning framework based on VideoMAE—the first application of VideoMAE to this task—leveraging spatiotemporal masked video modeling to enhance dynamic motion representation. The framework integrates YOLOv7-based person detection with temporal video augmentation to enable non-intrusive, real-time behavioral assessment. Evaluated on the SSBD dataset, our method achieves 97.7% classification accuracy, outperforming prior state-of-the-art approaches by 14.7 percentage points—the highest reported performance to date. This work demonstrates VideoMAE’s effectiveness for fine-grained, low-data behavioral recognition and establishes a practical, deployable technical pathway for unobtrusive early screening of ASD.

Technology Category

Application Category

📝 Abstract
Deep learning and advancements in contactless sensors have significantly enhanced our ability to understand complex human activities in healthcare settings. In particular, deep learning models utilizing computer vision have been developed to enable detailed analysis of human gesture recognition, especially repetitive gestures which are commonly observed behaviors in children with autism. This research work aims to identify repetitive behaviors indicative of autism by analyzing videos captured in natural settings as children engage in daily activities. The focus is on accurately categorizing real-time repetitive gestures such as spinning, head banging, and arm flapping. To this end, we utilize the publicly accessible Self-Stimulatory Behavior Dataset (SSBD) to classify these stereotypical movements. A key component of the proposed methodology is the use of extbf{VideoMAE}, a model designed to improve both spatial and temporal analysis of video data through a masking and reconstruction mechanism. This model significantly outperformed traditional methods, achieving an accuracy of 97.7%, a 14.7% improvement over the previous state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Identify repetitive autism behaviors via video analysis
Classify real-time gestures like spinning and head banging
Enhance video analysis accuracy using VideoMAE model
Innovation

Methods, ideas, or system contributions that make the work stand out.

YOLOv7 for real-time gesture detection
Video Augmentation to enhance dataset diversity
VideoMAE for spatiotemporal video analysis
🔎 Similar Papers
No similar papers found.
A
Amit Kumar Singh
Department of Information Technology, Indian Institute of Information Technology, Allahabad, 211013, India
T
Trapti Shrivastava
Department of Information Technology, Indian Institute of Information Technology, Allahabad, 211013, India
Vrijendra Singh
Vrijendra Singh
Professor, IT Dept.@Indian Institute of Information Technology Allahabad
ML & GenAIData AnalyticsSocial Network AnalysisInformation SecurityTime Series Analytics