AI Safety Training Can be Clinically Harmful

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current large language models (LLMs) deployed in mental health support lack clinical validation, and their safety alignment mechanisms may compromise therapeutic efficacy. This study presents the first systematic evaluation of four generative models across 250 prolonged exposure therapy scenarios and 146 cognitive behavioral therapy exercises, employing a three-reviewer LLM scoring panel to quantify task completion, safety, and therapeutic appropriateness across varying symptom severity levels. Findings reveal that therapeutic appropriateness drops sharply to 0.22–0.33 in high-severity cases, with protocol fidelity approaching zero; moreover, state-of-the-art models exhibit a decline in clinical utility—from 0.99 to 0.61—due to safety fine-tuning. To address these issues, this work proposes the first five-dimensional evaluation framework encompassing protocol fidelity, hallucination risk, and other critical factors, aligned with FDA and EU AI regulatory standards, thereby exposing the systematic undermining of core therapeutic mechanisms by RLHF-based safety alignment.

Technology Category

Application Category

📝 Abstract

Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.

Problem

Research questions and friction points this paper is trying to address.

AI safety

mental health

therapeutic harm

RLHF alignment

clinical efficacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI safety alignment

therapeutic mechanism disruption

five-axis evaluation framework