CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the scarcity of annotated Arabic mental health data—which hinders early detection research—this paper introduces ArabMHD, the first large-scale automatically annotated Arabic Reddit dataset for mental health. It covers six psychiatric conditions (e.g., anxiety, autism, depression) and a healthy control group. ArabMHD is constructed via a multi-stage NLP pipeline that integrates shallow classifiers with large language model–based label verification to ensure annotation quality. Compared to existing resources, it substantially surpasses them in scale (>1 million posts), thematic coverage, and user demographic diversity. Empirical evaluation demonstrates that ArabMHD enables robust multi-class mental state identification: it improves classification accuracy for Arabic mental health states under both zero-shot and supervised learning settings. By providing high-quality, scalable, and clinically informed annotations, ArabMHD fills a critical gap in computational mental health research for low-resource languages.

Technology Category

Application Category

📝 Abstract

Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited mental health resources for Arabic-speaking populations

Overcoming scarcity of annotated datasets for Arabic mental health detection

Advancing early detection of mental disorders in underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically annotated large-scale Arabic Reddit dataset

Qualitative and quantitative lexical and semantic analyses

Classification experiments with shallow and large language models

🔎 Similar Papers

No similar papers found.