FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether affective stimuli can compromise the safety alignment mechanisms of large language models (LLMs) in adversarial settings. Leveraging psychologically validated emotional priming—inducing states of stress, relaxation, or neutrality—within system prompts, the authors systematically evaluate the impact of emotional context on jailbreaking success rates across ten LLMs, using AdvBench attack prompts and the HarmBench evaluation framework. This work introduces psychological emotion induction as a novel, quantifiable attack surface in AI safety assessment. Experimental results demonstrate that stress priming significantly increases jailbreaking success by 65.2% (p<0.001), with open-weight models exhibiting heightened vulnerability. Strong correlations (|r|≥0.70) between psychological state indicators and attack efficacy confirm that emotional context serves as a critical predictor of model susceptibility.
📝 Abstract
Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2\% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p < 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.
Problem

Research questions and friction points this paper is trying to address.

emotional stimuli
safety alignment
jailbreak susceptibility
adversarial settings
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotional priming
safety alignment
jailbreak susceptibility
adversarial prompting
psychological stimuli
🔎 Similar Papers
No similar papers found.