Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Evaluating safety alignment of large language models (LLMs) in Chinese psychological crisis dialogues is challenging due to the absence of gold-standard references, high ethical sensitivity, and severe real-world risks. Method: This paper introduces PsyCrisis-Bench—the first dedicated benchmark for this domain—and proposes an “expert-reasoning-chain-driven LLM-as-Judge” paradigm. It employs a multi-dimensional binary scoring framework grounded in evidence-based psychological intervention principles, enabling reference-free, traceable, and interpretable safety assessments. Contribution/Results: The benchmark comprises 3,600 high-quality human-annotated samples covering sensitive scenarios—including self-harm, suicidal ideation, and existential distress—curated by clinical psychology experts. It achieves state-of-the-art inter-annotator agreement and yields evaluation rationales that are both more readable and clinically credible. The benchmark, dataset, and evaluation tools are publicly released.

Technology Category

Application Category

📝 Abstract

Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety alignment in Chinese mental health dialogues

Lack of gold-standard answers for high-risk ethical interactions

Need for expert-defined safety principles in model responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-Judge for safety evaluation

Expert-defined reasoning chains method

Binary point-wise scoring dimensions

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models