🤖 AI Summary
This study addresses critical challenges—data scarcity, linguistic complexity, and translation-induced distortion—confronting large language models (LLMs) in Arabic mental health applications. We conduct the first systematic, multi-task evaluation of eight LLMs across depression detection, suicide risk assessment, and mental health question answering. Methodologically, we employ structured prompt engineering, few-shot learning, and comparative analysis of native Arabic versus English–Arabic translation variants, evaluating performance using balanced accuracy and mean absolute error (MAE). Key findings include: (1) structured prompting improves multiclass accuracy by 14.5% on average; (2) few-shot learning boosts GPT-4o Mini’s performance by 1.58×; and (3) Phi-3.5 MoE achieves top binary classification accuracy, while Mistral NeMo attains the lowest MAE in severity prediction—demonstrating complementary model strengths. Our work establishes a reproducible evaluation framework and optimization paradigm for Arabic-language AI in mental health.
📝 Abstract
Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.