Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

📅 2024-08-03
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of standardization, low automation, and poor inter-rater agreement with human evaluation in safety assessment of mental health chatbots, this study proposes and validates a novel multimodal safety evaluation framework. The framework comprises a 100-item benchmark test set, an expert-consensus response guideline, and an agentic evaluation architecture integrating LLM-based scoring, semantic embedding comparison, and real-time knowledge retrieval. It introduces, for the first time, an expert-collaborative validation mechanism and a dynamic evaluation paradigm grounded in real-time data retrieval. Experimental results demonstrate that the framework significantly enhances response safety and reliability. Crucially, the agentic evaluation achieves a Cohen’s kappa of κ = 0.89 against human annotations—the highest reported to date—thereby empirically validating the critical role of real-time information integration in trustworthy safety assessment.

Technology Category

Application Category

📝 Abstract
Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.
Problem

Research questions and friction points this paper is trying to address.

Develop evaluation framework for mental health chatbot safety
Validate framework using expert guidelines and LLM-based tools
Enhance chatbot reliability through real-time data access
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed LLM-based scoring for chatbot evaluation
Implemented agentic approach using real-time data
Used embedding models to compare responses
🔎 Similar Papers
No similar papers found.
J
Jung In Park
Sue & Bill Gross School of Nursing, University of California, Irvine, CA
Mahyar Abbasian
Mahyar Abbasian
Department of Computer Sciences, University of California, Irvine, CA
Iman Azimi
Iman Azimi
Thrive AI Health
Mobile HealthArtificial IntelligenceLarge Language ModelBiomedical Engineering
D
Dawn Bounds
Sue & Bill Gross School of Nursing, University of California, Irvine, CA
A
Angela Jun
Sue & Bill Gross School of Nursing, University of California, Irvine, CA
J
Jaesu Han
School of Medicine, University of California, Irvine, CA
R
Robert McCarron
School of Medicine, University of California, Irvine, CA
J
Jessica Borelli
Department of Psychological Science, University of California, Irvine, CA
P
Parmida Safavi
S
Sanaz Mirbaha
J
Jia Li
HealthUnity, Palo Alto, CA
M
Mona Mahmoudi
HealthUnity, Palo Alto, CA
C
Carmen Wiedenhoeft
HealthUnity, Palo Alto, CA
A
Amir M Rahmani
Sue & Bill Gross School of Nursing, University of California, Irvine, CA; Department of Computer Sciences, University of California, Irvine, CA