🤖 AI Summary
This work addresses the fine-grained recognition of repetitive motor stereotypies (e.g., spinning, head-banging, arm-flapping) exhibited by children with autism spectrum disorder (ASD) in naturalistic settings. We propose a novel self-supervised learning framework based on VideoMAE—the first application of VideoMAE to this task—leveraging spatiotemporal masked video modeling to enhance dynamic motion representation. The framework integrates YOLOv7-based person detection with temporal video augmentation to enable non-intrusive, real-time behavioral assessment. Evaluated on the SSBD dataset, our method achieves 97.7% classification accuracy, outperforming prior state-of-the-art approaches by 14.7 percentage points—the highest reported performance to date. This work demonstrates VideoMAE’s effectiveness for fine-grained, low-data behavioral recognition and establishes a practical, deployable technical pathway for unobtrusive early screening of ASD.
📝 Abstract
Deep learning and advancements in contactless sensors have significantly enhanced our ability to understand complex human activities in healthcare settings. In particular, deep learning models utilizing computer vision have been developed to enable detailed analysis of human gesture recognition, especially repetitive gestures which are commonly observed behaviors in children with autism. This research work aims to identify repetitive behaviors indicative of autism by analyzing videos captured in natural settings as children engage in daily activities. The focus is on accurately categorizing real-time repetitive gestures such as spinning, head banging, and arm flapping. To this end, we utilize the publicly accessible Self-Stimulatory Behavior Dataset (SSBD) to classify these stereotypical movements. A key component of the proposed methodology is the use of extbf{VideoMAE}, a model designed to improve both spatial and temporal analysis of video data through a masking and reconstruction mechanism. This model significantly outperformed traditional methods, achieving an accuracy of 97.7%, a 14.7% improvement over the previous state-of-the-art.