MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mental health analysis benchmarks suffer from limited scale, insufficient data curation, inadequate multilingual coverage, and poor representation of harmful content. To address these limitations, we introduce RMH-Bench—the first large-scale, high-quality mental health benchmark derived from Reddit—comprising over 13 million self-reported, diagnosis-annotated posts across seven psychiatric disorders. Our rigorous curation pipeline integrates LIWC-based linguistic analysis, NSFW content filtering, cross-lingual identification, and strict deduplication. RMH-Bench is the first benchmark to standardize large-scale, multilingual mental health data with controlled inclusion of harmful content, significantly expanding disorder coverage and scenario diversity. Empirical evaluation demonstrates substantial improvements: on autism detection, models trained on RMH-Bench achieve up to an 18-percentage-point gain in F1 score over prior benchmarks. These results validate RMH-Bench’s effectiveness, robustness, and generalizability for mental health NLP research.

Technology Category

Application Category

📝 Abstract
Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, extbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over extbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an extbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.
Problem

Research questions and friction points this paper is trying to address.

Outdated mental health benchmarks due to limited data and poor cleaning
Lack of large-scale social media datasets for mental health analysis
Inadequate handling of diverse content like multilingual and harmful material
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Reddit dataset with self-reported diagnoses
Rigorous preprocessing for language filtering and content safety
Binary classification experiments with fine-tuned language models
🔎 Similar Papers
No similar papers found.
S
Saad Mankarious
School of Engineering and Applied Science, George Washington University, Washington, D.C. 20037
Ayah Zirikly
Ayah Zirikly
Assistant Professor, George Washington University, Johns Hopkins University
Natural Language ProcessingMental HealthClinical NLPBioNLPArabic NLP
D
Daniel Wiechmann
Institute for Logic, Language & Computation, University of Amsterdam, the Netherlands
E
E. Kerz
Exaia Technologies, Germany
E
E. Kempa
Department of Computer and Information Science and Engineering, University of Florida, USA
Y
Yu Qiao
Exaia Technologies, Germany