CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-based mental health counseling evaluation suffers from three key limitations: (1) client simulation lacks clinical validity, (2) overreliance on static question-answering formats, and (3) narrow, unidimensional metrics insufficient for capturing holistic performance in complex, real-world cases. To address these, we propose the first psychology-informed, multidimensional dynamic evaluation benchmark. It constructs clinically grounded client personas and dialogue simulation mechanisms based on authentic counseling transcripts and expert annotations, and integrates validated psychological scales for quantitative, multi-attribute assessment. Departing from static, single-turn evaluation paradigms, our framework enables fine-grained analysis of empathy, causal reasoning, and intervention capability across diverse interactive scenarios. Extensive experiments on multiple general-purpose and domain-specific LLMs systematically identify model weaknesses across client subtypes, providing empirical foundations and methodological guidance for advancing AI-assisted psychotherapy.

Technology Category

Application Category

📝 Abstract
The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce extbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs'failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' counseling competence with professional client simulations
Addressing limitations of static evaluation formats in psychological counseling
Providing multidimensional performance assessment based on psychological scales
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse client profiles from real counseling cases
Dynamic interactive benchmark with expert-guided simulation
Multidimensional evaluation based on psychological scales
🔎 Similar Papers
No similar papers found.