A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical challenges—data scarcity, linguistic complexity, and translation-induced distortion—confronting large language models (LLMs) in Arabic mental health applications. We conduct the first systematic, multi-task evaluation of eight LLMs across depression detection, suicide risk assessment, and mental health question answering. Methodologically, we employ structured prompt engineering, few-shot learning, and comparative analysis of native Arabic versus English–Arabic translation variants, evaluating performance using balanced accuracy and mean absolute error (MAE). Key findings include: (1) structured prompting improves multiclass accuracy by 14.5% on average; (2) few-shot learning boosts GPT-4o Mini’s performance by 1.58×; and (3) Phi-3.5 MoE achieves top binary classification accuracy, while Mistral NeMo attains the lowest MAE in severity prediction—demonstrating complementary model strengths. Our work establishes a reproducible evaluation framework and optimization paradigm for Arabic-language AI in mental health.

Technology Category

Application Category

📝 Abstract
Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.
Problem

Research questions and friction points this paper is trying to address.

Arabic mental health data
large language models
diagnosis and treatment assistance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Mental Health Assessment
Large Language Models
Optimized Prompt Design
🔎 Similar Papers
No similar papers found.
N
Noureldin Zahran
Compumacy for Artificial Intelligence solutions, Cairo, Egypt
A
Aya E. Fouda
Compumacy for Artificial Intelligence solutions, Cairo, Egypt
R
Radwa J. Hanafy
Compumacy for Artificial Intelligence solutions, Cairo, Egypt; Department of Behavioural Health, Saint Elizabeths Hospital, Washington DC, 20032
Mohammed E. Fouda
Mohammed E. Fouda
Unknown affiliation