🤖 AI Summary
This study addresses the critical challenge of reliably evaluating the therapeutic quality of large language models (LLMs) in mental health support by balancing empathic engagement with cognitive reliability. The authors propose a novel human evaluation framework centered on therapeutic sensitivity, introducing the first six-dimensional assessment scale that integrates both cognitive support and emotional resonance. A panel of psychiatric experts conducted multidimensional manual ratings of responses from nine open- and closed-source LLMs across 500 real-world psychological dialogue scenarios. Results reveal a pervasive “cognition–empathy gap”: closed-source models (e.g., GPT-4o) demonstrate more balanced performance, whereas open-source models exhibit weaker emotional expression and inconsistent affective alignment despite relatively reliable cognitive content. This work establishes a reproducible benchmark and methodological foundation for evaluating and improving AI systems in mental health applications.
📝 Abstract
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.