MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the limitations of current AI-based mental health assessments, which rely on aggregate metrics and fail to capture fine-grained failure modes—particularly in high-risk psychological states such as suicidal ideation—and lack dynamic risk awareness across multi-turn conversations. To bridge this gap, we introduce MHDash, the first open-source evaluation platform tailored for mental health AI systems. MHDash integrates multidimensional human annotations (covering concern type, risk severity, and conversational intent), simulated multi-turn dialogues, and benchmark comparisons against baseline models. Our framework enables, for the first time, risk-sensitive, fine-grained auditing, revealing that while mainstream large language models exhibit comparable overall performance, they differ significantly in false-negative rates under high-risk conditions—a disparity that further widens during extended interactions—thereby exposing the inadequacy of conventional evaluation paradigms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.

Problem

Research questions and friction points this paper is trying to address.

mental health

large language models

suicidal ideation

risk assessment

AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

mental health-aware AI

risk-aware evaluation

multi-turn dialogue benchmarking