🤖 AI Summary
Existing mental health analysis benchmarks suffer from limited scale, insufficient data curation, inadequate multilingual coverage, and poor representation of harmful content. To address these limitations, we introduce RMH-Bench—the first large-scale, high-quality mental health benchmark derived from Reddit—comprising over 13 million self-reported, diagnosis-annotated posts across seven psychiatric disorders. Our rigorous curation pipeline integrates LIWC-based linguistic analysis, NSFW content filtering, cross-lingual identification, and strict deduplication. RMH-Bench is the first benchmark to standardize large-scale, multilingual mental health data with controlled inclusion of harmful content, significantly expanding disorder coverage and scenario diversity. Empirical evaluation demonstrates substantial improvements: on autism detection, models trained on RMH-Bench achieve up to an 18-percentage-point gain in F1 score over prior benchmarks. These results validate RMH-Bench’s effectiveness, robustness, and generalizability for mental health NLP research.
📝 Abstract
Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, extbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over extbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an extbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.