MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing mental health analysis benchmarks suffer from limited scale, insufficient data curation, inadequate multilingual coverage, and poor representation of harmful content. To address these limitations, we introduce RMH-Bench—the first large-scale, high-quality mental health benchmark derived from Reddit—comprising over 13 million self-reported, diagnosis-annotated posts across seven psychiatric disorders. Our rigorous curation pipeline integrates LIWC-based linguistic analysis, NSFW content filtering, cross-lingual identification, and strict deduplication. RMH-Bench is the first benchmark to standardize large-scale, multilingual mental health data with controlled inclusion of harmful content, significantly expanding disorder coverage and scenario diversity. Empirical evaluation demonstrates substantial improvements: on autism detection, models trained on RMH-Bench achieve up to an 18-percentage-point gain in F1 score over prior benchmarks. These results validate RMH-Bench’s effectiveness, robustness, and generalizability for mental health NLP research.

Technology Category

Application Category

📝 Abstract

Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, extbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over extbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an extbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

Problem

Research questions and friction points this paper is trying to address.

Outdated mental health benchmarks due to limited data and poor cleaning

Lack of large-scale social media datasets for mental health analysis

Inadequate handling of diverse content like multilingual and harmful material

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Reddit dataset with self-reported diagnoses

Rigorous preprocessing for language filtering and content safety

Binary classification experiments with fine-tuned language models

🔎 Similar Papers

No similar papers found.