🤖 AI Summary
To address the limited accessibility of ASD clinical diagnosis in resource-constrained regions, this paper proposes a lightweight multimodal automated diagnostic method based on eye-tracking trajectories. The method introduces a novel ViT-Mamba hybrid architecture and, for the first time, incorporates a cross-architecture attention fusion mechanism to jointly model visual, vocal, and facial dynamic cues—effectively capturing both spatial representations and long-range temporal dependencies. To enhance clinical interpretability and trustworthiness, Grad-CAM-based explainable AI is integrated, eliminating reliance on handcrafted feature engineering. Evaluated on the Saliency4ASD dataset, the approach achieves 96% accuracy, 95% F1-score, 97% sensitivity, and 94% specificity—substantially outperforming state-of-the-art methods. Its lightweight design enables efficient remote screening, offering a scalable solution for early ASD detection in underserved settings.
📝 Abstract
Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.