PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the lack of clinically credible, interpretable, and behaviorally diverse evaluation methods for depression patient simulators, which currently rely excessively on vague prompt-driven judgments from large language models. To this end, the authors propose PSI-Bench—the first multi-granularity automatic evaluation framework tailored for depression simulators—integrating clinical psychology metrics with natural language processing techniques to enable clinically aligned, interpretable, and multidimensional assessment at turn, dialogue, and population levels. Benchmarking reveals systematic biases in existing simulators regarding emotional trajectories, response length, and behavioral variability. Human expert validation demonstrates strong alignment between PSI-Bench evaluations and clinical judgments, further uncovering that simulation framework design exerts a significantly greater impact on fidelity than model scale, thereby offering reliable guidance for future simulator development.
📝 Abstract
Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
Problem

Research questions and friction points this paper is trying to address.

depression patient simulators
evaluation framework
behavioral diversity
clinical grounding
interpretable benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

depression patient simulator
clinically grounded evaluation
behavioral diversity
interpretable benchmark
automatic evaluation framework
🔎 Similar Papers
No similar papers found.