🤖 AI Summary
This work addresses the critical challenge that large language models often lack clinical appropriateness judgment in mental health conversations, leading to misclassification or missed detection of genuine crises due to an inability to distinguish therapeutic disclosure from urgent risk. To tackle this, we propose the first clinically grounded risk classification framework tailored for multi-turn mental health dialogues, with turn-by-turn annotations by licensed clinicians. We further introduce a dual-agent synthetic dialogue generation strategy to augment high-quality training data. Building upon this foundation, we develop MindGuard—a lightweight (4B/8B parameter) multi-turn safety classifier—that achieves high recall while significantly reducing false positives. Experimental results demonstrate that integrating MindGuard with clinical language models effectively suppresses attack success rates and harmful interactions in adversarial multi-turn evaluations. The annotated dataset, MindGuard-testset, is publicly released.
📝 Abstract
Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.