MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current AI-based mental health assessments, which rely on aggregate metrics and fail to capture fine-grained failure modes—particularly in high-risk psychological states such as suicidal ideation—and lack dynamic risk awareness across multi-turn conversations. To bridge this gap, we introduce MHDash, the first open-source evaluation platform tailored for mental health AI systems. MHDash integrates multidimensional human annotations (covering concern type, risk severity, and conversational intent), simulated multi-turn dialogues, and benchmark comparisons against baseline models. Our framework enables, for the first time, risk-sensitive, fine-grained auditing, revealing that while mainstream large language models exhibit comparable overall performance, they differ significantly in false-negative rates under high-risk conditions—a disparity that further widens during extended interactions—thereby exposing the inadequacy of conventional evaluation paradigms.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.
Problem

Research questions and friction points this paper is trying to address.

mental health
large language models
suicidal ideation
risk assessment
AI safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

mental health-aware AI
risk-aware evaluation
multi-turn dialogue benchmarking
structured annotation
safety-critical LLMs
🔎 Similar Papers
No similar papers found.
Yihe Zhang
Yihe Zhang
Research Scientist, University of Louisiana at Lafayette
AI SecuritySocial Network Security
C
Cheyenne N Mohawk
Department of Psychology, Univeristy of Louisiana at Lafayette
K
Kaiying Han
Informatics Research Institute, Univeristy of Louisiana at Lafayette
Vijay Srinivas Tida
Vijay Srinivas Tida
Assistant Professor
Machine LearningDeep LearningVLSINatural Language ProcessingDifferential Privacy
M
Manyu Li
Department of Psychology, Univeristy of Louisiana at Lafayette
X
Xiali Hei
School of Computing and Informatics, Univeristy of Louisiana at Lafayette