Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the limited accessibility of ASD clinical diagnosis in resource-constrained regions, this paper proposes a lightweight multimodal automated diagnostic method based on eye-tracking trajectories. The method introduces a novel ViT-Mamba hybrid architecture and, for the first time, incorporates a cross-architecture attention fusion mechanism to jointly model visual, vocal, and facial dynamic cues—effectively capturing both spatial representations and long-range temporal dependencies. To enhance clinical interpretability and trustworthiness, Grad-CAM-based explainable AI is integrated, eliminating reliance on handcrafted feature engineering. Evaluated on the Saliency4ASD dataset, the approach achieves 96% accuracy, 95% F1-score, 97% sensitivity, and 94% specificity—substantially outperforming state-of-the-art methods. Its lightweight design enables efficient remote screening, offering a scalable solution for early ASD detection in underserved settings.

Technology Category

Application Category

📝 Abstract

Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.

Problem

Research questions and friction points this paper is trying to address.

Hybrid deep learning for ASD diagnosis via eye-tracking

Integrating visual, speech, and facial cues with attention fusion

Enhancing diagnostic accuracy and transparency in clinical settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Vision Transformer-Mamba for ASD diagnosis

Attention-based fusion of visual, speech, facial cues

Explainable AI enhances diagnostic accuracy and transparency

🔎 Similar Papers

No similar papers found.

Bosch Group

Attraktive Vergütung

Horb am Neckar, BW, DE

AI Research Scientist, VLM (vision language models)