🤖 AI Summary
This study addresses the latent “dialogue escalation” risk in AI emotional companions—namely, non-toxic yet progressively intensifying emotional reinforcement or affective drift that exacerbates user distress. To this end, we propose GAUGE, a novel framework that introduces LLM output-logit–based probabilistic dynamic modeling. Unlike conventional approaches, GAUGE operates directly on logits without external classifiers, enabling fine-grained, real-time quantification of affective state transitions. Evaluated against standard toxicity filters and clinical assessment scales, GAUGE achieves significantly higher detection rates for implicit affective harm, while offering millisecond-level latency, high sensitivity, and lightweight deployment. By grounding safety monitoring in interpretable, logit-level dynamics, GAUGE establishes a practical, explainable paradigm for evaluating emotional safety in large language model–driven interpersonal interactions.
📝 Abstract
Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of extit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), a lightweight, logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM's output probabilistically shifts the affective state of a dialogue.