AI Content Moderation in Therapy Conversations

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current content moderation mechanisms in large language models may misclassify sensitive yet clinically essential content in psychotherapeutic dialogues, thereby hindering their effective deployment in mental health support. This study presents the first systematic algorithmic audit of three widely used moderation systems—OpenAI Moderation, Llama Guard, and Shield Gemma—evaluating their labeling behavior on real-world therapy transcripts. The findings reveal that these systems frequently misidentify normative therapeutic discourse as policy-violating, exposing a significant misalignment between existing safety protocols and clinical communication needs. This discrepancy underscores the urgent necessity to redesign content moderation strategies specifically tailored to the nuanced requirements of mental health contexts.
📝 Abstract
Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.
Problem

Research questions and friction points this paper is trying to address.

AI content moderation
therapy conversations
large language models
sensitive content
algorithmic bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI content moderation
therapy conversations
algorithmic auditing
large language models
mental health AI
🔎 Similar Papers
No similar papers found.