An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large language models (LLMs) frequently generate harmful responses in mental health crisis conversations, necessitating highly sensitive safety filtering mechanisms. To address this, we propose the Verily Behavioral Health Safety Filter (VBHSF), a machine learning–based classifier trained on clinically annotated crisis data from the Verily and NVIDIA Aegis datasets. VBHSF prioritizes minimizing false negatives—thereby maximizing sensitivity—while maintaining strong generalization and robustness across diverse psychological crisis scenarios. Experimental evaluation demonstrates state-of-the-art performance: on the Verily dataset, VBHSF achieves 0.990 sensitivity and 0.992 specificity (F1 = 0.939); on the NVIDIA dataset, it attains 0.982 sensitivity and 0.921 accuracy—both significantly surpassing OpenAI’s Omni Moderation Latest and NVIDIA’s NeMo Guardrails. To our knowledge, VBHSF is the first open-source safety filter specifically designed for mental health emergencies that simultaneously delivers high sensitivity and robust real-world reliability.

Technology Category

Application Category

📝 Abstract

Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was >= 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p < 0.001) and higher specificity relative to NVIDIA NeMo (p < 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.

Problem

Research questions and friction points this paper is trying to address.

Detecting mental health crises in text conversations

Preventing harmful AI responses during psychiatric emergencies

Evaluating safety filters against clinical standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI safety filter detects mental health crises in conversations

Clinician-labeled datasets validate behavioral health crisis identification

Superior sensitivity compared to existing content moderation guardrails

🔎 Similar Papers

No similar papers found.