Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited accessibility of ASD clinical diagnosis in resource-constrained regions, this paper proposes a lightweight multimodal automated diagnostic method based on eye-tracking trajectories. The method introduces a novel ViT-Mamba hybrid architecture and, for the first time, incorporates a cross-architecture attention fusion mechanism to jointly model visual, vocal, and facial dynamic cues—effectively capturing both spatial representations and long-range temporal dependencies. To enhance clinical interpretability and trustworthiness, Grad-CAM-based explainable AI is integrated, eliminating reliance on handcrafted feature engineering. Evaluated on the Saliency4ASD dataset, the approach achieves 96% accuracy, 95% F1-score, 97% sensitivity, and 94% specificity—substantially outperforming state-of-the-art methods. Its lightweight design enables efficient remote screening, offering a scalable solution for early ASD detection in underserved settings.

Technology Category

Application Category

📝 Abstract
Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.
Problem

Research questions and friction points this paper is trying to address.

Hybrid deep learning for ASD diagnosis via eye-tracking
Integrating visual, speech, and facial cues with attention fusion
Enhancing diagnostic accuracy and transparency in clinical settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Vision Transformer-Mamba for ASD diagnosis
Attention-based fusion of visual, speech, facial cues
Explainable AI enhances diagnostic accuracy and transparency
🔎 Similar Papers
No similar papers found.
W
Wafaa Kasri
Faculty of Science and Technology, Tissemsilt University, Bougara 38000, Algeria
Y
Yassine Himeur
College of Engineering and Information Technology, University of Dubai, Dubai, UAE
A
Abigail Copiaco
College of Engineering and Information Technology, University of Dubai, Dubai, UAE
W
W. Mansoor
College of Engineering and Information Technology, University of Dubai, Dubai, UAE
A
Ammar Albanna
College of Medicine and Health Sciences, Mohammed Bin Rashid University, Dubai, UAE
Valsamma Eapen
Valsamma Eapen
UNSW