🤖 AI Summary
Current evaluation methods struggle to capture the clinical harms arising from the interactive and context-dependent nature of large language models in multi-turn psychological counseling. This work proposes R-MHSafe, a role-aware safety taxonomy, and MHSafeEval, a closed-loop agent-based evaluation framework. It introduces role perspectives—such as perpetrator or instigator—to characterize harm types, integrating clinical psychology knowledge to construct a dual-dimensional role–harm schema. By simulating adversarial multi-turn agent interactions, the approach enables trajectory-level safety analysis. Experimental results demonstrate that this method substantially expands coverage of failure modes and enhances diagnostic granularity, uncovering role-dependent and cumulative risks that existing benchmarks commonly overlook.
📝 Abstract
Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.