CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study systematically evaluates the quality and safety of large language models (LLMs) in single-turn mental health counseling, with particular focus on critical risks such as unauthorized medical advice. Method: We introduce CounselBench—the first large-scale, expert-constructed evaluation benchmark for mental health counseling—comprising 2,000 clinically annotated model responses and 120 adversarially designed questions authored by 100 licensed clinicians. We propose a novel evaluation framework integrating clinical multi-dimensional scoring, span-level fine-grained annotation, and LLM-judge bias analysis, and release CounselBench-Adv—the first adversarial test suite for psychological domains. Contribution/Results: Experiments reveal that LLMs often achieve higher perceived quality than online human counselors, yet exhibit pervasive and severe safety risks; LLM judges demonstrate significant self-overestimation and risk-ignorance biases; and eight mainstream models consistently manifest distinct, identifiable failure patterns.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly proposed for use in mental health support, yet their behavior in realistic counseling scenarios remains largely untested. We introduce CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test LLMs in single-turn counseling. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of responses from GPT-4, LLaMA 3, Gemini, and online human therapists to real patient questions. Each response is rated along six clinically grounded dimensions, with written rationales and span-level annotations. We find that LLMs often outperform online human therapists in perceived quality, but experts frequently flag their outputs for safety concerns such as unauthorized medical advice. Follow-up experiments show that LLM judges consistently overrate model responses and overlook safety issues identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored counseling questions designed to trigger specific model issues. Evaluation across 2,880 responses from eight LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking and improving LLM behavior in high-stakes mental health settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance in mental health counseling scenarios

Assessing safety concerns in LLM-generated counseling responses

Identifying failure patterns in LLMs under adversarial questioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale expert benchmark for LLM counseling evaluation

Adversarial dataset to trigger model-specific failures

Clinically grounded dimensions for response quality assessment

🔎 Similar Papers

No similar papers found.