🤖 AI Summary
This study addresses the challenge of video-based behavioral recognition in neurodiverse populations—particularly individuals with Autism Spectrum Disorder (ASD). We propose the first ASD video classification framework designed for multimodal sensory stimulation responses. To support this, we construct a novel dataset comprising 2,467 videos of children reacting to gustatory, olfactory, auditory, and tactile stimuli; extract CNN-attention features from 1.4 million frames; and annotate each video with head pose angles and fine-grained, temporally aligned facial expression descriptions. Our method innovatively integrates physiological response frame features, motion-correction parameters, and structured temporal semantic labels—overcoming limitations of conventional static-image and unimodal approaches. Experiments demonstrate that head pose estimation effectively suppresses motion-induced noise, while structured temporal annotations significantly improve classification accuracy. Results validate the critical importance of multimodal stimulation paradigms and fine-grained behavioral modeling for ASD video analysis.
📝 Abstract
Autism Spectrum Disorder (ASD) can affect individuals at varying degrees of intensity, from challenges in overall health, communication, and sensory processing, and this often begins at a young age. Thus, it is critical for medical professionals to be able to accurately diagnose ASD in young children, but doing so is difficult. Deep learning can be responsibly leveraged to improve productivity in addressing this task. The availability of data, however, remains a considerable obstacle. Hence, in this work, we introduce the Video ASD dataset--a dataset that contains video frame convolutional and attention map feature data--to foster further progress in the task of ASD classification. The original videos showcase children reacting to chemo-sensory stimuli, among auditory, touch, and vision This dataset contains the features of the frames spanning 2,467 videos, for a total of approximately 1.4 million frames. Additionally, head pose angles are included to account for head movement noise, as well as full-sentence text labels for the taste and smell videos that describe how the facial expression changes before, immediately after, and long after interaction with the stimuli. In addition to providing features, we also test foundation models on this data to showcase how movement noise affects performance and the need for more data and more complex labels.