🤖 AI Summary
Early identification of Autism Spectrum Disorder (ASD) in children lacks multilingual speech analytics, particularly for code-switching populations. Method: We introduce CoSAm—the first English–Hindi code-switching child speech dataset—and propose a modality-order-sensitive hierarchical fusion framework. It is the first to empirically demonstrate that input ordering among acoustic, linguistic, and paralinguistic modalities critically impacts ASD classification performance. Our two-stage architecture first fuses MFCCs, statistical acoustic features, and linguistic representations; subsequently, paralinguistic features are integrated. The model employs a Transformer encoder to ensure robustness and interpretability. Results: Evaluated on 61 ASD and 31 typically developing child utterances, our approach achieves 98.75% classification accuracy—significantly outperforming parallel fusion and unimodal baselines. This work establishes a novel paradigm and empirical foundation for multilingual, speech-based ASD screening.
📝 Abstract
Autism Spectrum Disorder (ASD) is a complex neuro-developmental challenge, presenting a spectrum of difficulties in social interaction, communication, and the expression of repetitive behaviors in different situations. This increasing prevalence underscores the importance of ASD as a major public health concern and the need for comprehensive research initiatives to advance our understanding of the disorder and its early detection methods. This study introduces a novel hierarchical feature fusion method aimed at enhancing the early detection of ASD in children through the analysis of code-switched speech (English and Hindi). Employing advanced audio processing techniques, the research integrates acoustic, paralinguistic, and linguistic information using Transformer Encoders. This innovative fusion strategy is designed to improve classification robustness and accuracy, crucial for early and precise ASD identification. The methodology involves collecting a code-switched speech corpus, CoSAm, from children diagnosed with ASD and a matched control group. The dataset comprises 61 voice recordings from 30 children diagnosed with ASD and 31 from neurotypical children, aged between 3 and 13 years, resulting in a total of 159.75 minutes of voice recordings. The feature analysis focuses on MFCCs and extensive statistical attributes to capture speech pattern variability and complexity. The best model performance is achieved using a hierarchical fusion technique with an accuracy of 98.75% using a combination of acoustic and linguistic features first, followed by paralinguistic features in a hierarchical manner.