A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study addresses critical challenges—data scarcity, linguistic complexity, and translation-induced distortion—confronting large language models (LLMs) in Arabic mental health applications. We conduct the first systematic, multi-task evaluation of eight LLMs across depression detection, suicide risk assessment, and mental health question answering. Methodologically, we employ structured prompt engineering, few-shot learning, and comparative analysis of native Arabic versus English–Arabic translation variants, evaluating performance using balanced accuracy and mean absolute error (MAE). Key findings include: (1) structured prompting improves multiclass accuracy by 14.5% on average; (2) few-shot learning boosts GPT-4o Mini’s performance by 1.58×; and (3) Phi-3.5 MoE achieves top binary classification accuracy, while Mistral NeMo attains the lowest MAE in severity prediction—demonstrating complementary model strengths. Our work establishes a reproducible evaluation framework and optimization paradigm for Arabic-language AI in mental health.

Technology Category

Application Category

📝 Abstract

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.

Problem

Research questions and friction points this paper is trying to address.

Arabic mental health data

large language models

diagnosis and treatment assistance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Mental Health Assessment

Large Language Models

Optimized Prompt Design

🔎 Similar Papers

Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges