Combining Artificial Users and Psychotherapist Assessment to Evaluate Large Language Model-based Mental Health Chatbots

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Evaluating LLM-based mental health chatbots faces a fundamental trade-off between clinical safety assurance and comprehensive test coverage. Method: We propose a dual-track evaluation framework integrating “artificial users” and “therapist-in-the-loop” supervision. Artificial users are systematically generated from patient case profiles—parameterized by depression severity, personality traits, and technology attitudes—to enable controllable, multidimensional dialogue testing. Clinical validity is ensured via real-time therapist oversight and standardized clinical instruments (e.g., Behavioral Activation Questionnaire [BAQ], Safety Assessment Checklist [SAC]). Contribution/Results: Experiments demonstrate that the framework achieves high test coverage even with moderately realistic artificial users. It confirms the system’s foundational capability in behavioral activation interventions and safety maintenance, while precisely identifying critical improvement dimensions—particularly appropriateness of activity planning—thereby bridging the gap between scalable automation and clinical rigor in conversational AI evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) promise to overcome limitations of rule-based mental health chatbots through more natural conversations. However, evaluating LLM-based mental health chatbots presents a significant challenge: Their probabilistic nature requires comprehensive testing to ensure therapeutic quality, yet conducting such evaluations with people with depression would impose an additional burden on vulnerable people and risk exposing them to potentially harmful content. Our paper presents an evaluation approach for LLM-based mental health chatbots that combines dialogue generation with artificial users and dialogue evaluation by psychotherapists. We developed artificial users based on patient vignettes, systematically varying characteristics such as depression severity, personality traits, and attitudes toward chatbots, and let them interact with a LLM-based behavioral activation chatbot. Ten psychotherapists evaluated 48 randomly selected dialogues using standardized rating scales to assess the quality of behavioral activation and its therapeutic capabilities. We found that while artificial users showed moderate authenticity, they enabled comprehensive testing across different users. In addition, the chatbot demonstrated promising capabilities in delivering behavioral activation and maintaining safety. Furthermore, we identified deficits, such as ensuring the appropriateness of the activity plan, which reveals necessary improvements for the chatbot. Our framework provides an effective method for evaluating LLM-based mental health chatbots while protecting vulnerable people during the evaluation process. Future research should improve the authenticity of artificial users and develop LLM-augmented evaluation tools to make psychotherapist evaluation more efficient, and thus further advance the evaluation of LLM-based mental health chatbots.

Problem

Research questions and friction points this paper is trying to address.

Evaluating therapeutic quality of LLM-based mental health chatbots safely

Assessing chatbot effectiveness across diverse user characteristics

Identifying improvements for LLM-based behavioral activation delivery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Artificial users simulate diverse patient characteristics

Psychotherapists evaluate dialogues with standardized scales

Combines dialogue generation and expert assessment

🔎 Similar Papers

No similar papers found.