🤖 AI Summary
This work addresses the limitations of current AI-based mental health assessments, which rely on aggregate metrics and fail to capture fine-grained failure modes—particularly in high-risk psychological states such as suicidal ideation—and lack dynamic risk awareness across multi-turn conversations. To bridge this gap, we introduce MHDash, the first open-source evaluation platform tailored for mental health AI systems. MHDash integrates multidimensional human annotations (covering concern type, risk severity, and conversational intent), simulated multi-turn dialogues, and benchmark comparisons against baseline models. Our framework enables, for the first time, risk-sensitive, fine-grained auditing, revealing that while mainstream large language models exhibit comparable overall performance, they differ significantly in false-negative rates under high-risk conditions—a disparity that further widens during extended interactions—thereby exposing the inadequacy of conventional evaluation paradigms.
📝 Abstract
Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.