Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Language and voice abnormalities in neuropsychiatric disorders—Alzheimer’s disease (AD), depression, and autism spectrum disorder (ASD)—constitute promising early biomarkers; however, existing multimodal approaches suffer from poor cross-lingual generalizability and inconsistent evaluation protocols. To address these limitations, we propose FEND: a unified assessment framework spanning the full lifespan, integrating speech and text modalities, and supporting five languages (English, Chinese, Greek, French, Dutch). We introduce the first benchmark comprising 13 datasets across all three disorders and five languages. FEND incorporates large-model-based cross-lingual speech–text joint representation learning, standardized cross-dataset evaluation protocols, and a novel modality importance quantification method. Experiments show that multimodal fusion significantly outperforms unimodal baselines in AD and depression detection; ASD performance degradation is primarily attributable to dataset heterogeneity. Cross-lingual tasks achieve >86% accuracy under homogeneous conditions but drop by 12–19% in heterogeneous settings.

Technology Category

Application Category

📝 Abstract
Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.
Problem

Research questions and friction points this paper is trying to address.

Develops a multi-modal framework for detecting neuropsychiatric disorders across lifespan
Addresses challenges in multi-lingual generalization and unified evaluation for disorder assessment
Evaluates performance of speech-text fusion across diverse languages and datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal framework integrating speech and text
Leveraging 13 multi-lingual datasets for evaluation
Extensive benchmarks for automated lifespan-inclusive assessment
Z
Zhongren Dong
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
H
Haotian Guo
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
W
Weixiang Xu
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
H
Huan Zhao
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
Zixing Zhang
Zixing Zhang
Professor, Hunan University
Artifical IntelligenceSpeech ProcessingAffective ComputingDigital HealthAutomatic Speech Recognition