A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Global mental health services face a widening supply-demand gap, while AI-assisted clinical tool development is hindered by the scarcity of high-quality, clinically validated datasets—commonly plagued by fragmentation, poor documentation, limited accessibility, narrow cultural representation, inconsistent annotation, and single-modality design, thereby undermining model reproducibility, generalizability, and fairness. Method: We conduct the first systematic, multidimensional evaluation of real-world and synthetic datasets used for clinical mental health AI training, assessing them across psychiatric disorder categories, data modalities, task objectives, and sociocultural contexts. Contribution/Results: Our analysis identifies critical gaps—including severe longitudinal data scarcity, inadequate cultural diversity, and substantial limitations in synthetic data fidelity and utility. We propose a standardized annotation framework, a cross-cultural collaborative data curation mechanism, and actionable pathways to enhance dataset accessibility—establishing a methodological foundation and practical guidelines for developing reproducible, robust, and equitable clinical AI models.

Technology Category

Application Category

📝 Abstract
Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.
Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality clinical datasets for mental health AI
Scattered and inaccessible datasets hinder AI model development
Insufficient cultural and linguistic diversity in existing datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey of clinical mental health datasets
Categorize datasets by disorders and modalities
Identify gaps in data quality and diversity