🤖 AI Summary
Existing jailbreaking research overlooks the linguistic and psychological underpinnings of prompt-based attacks, particularly how human persuasion principles interact with LLM alignment mechanisms. Method: We systematically integrate classical social-psychological persuasion strategies into jailbreaking prompt design, introducing the novel concept of “persuasion fingerprints”—identifiable linguistic patterns in model responses that reflect deliberate persuasive structuring. Through cross-model empirical evaluation and fine-grained linguistic behavioral analysis, we quantify the transferability and attack efficacy of persuasion-infused prompts. Results: Persuasion-structured prompts significantly compromise mainstream aligned models (e.g., Llama-3-Instruct, Qwen2.5-72B-Instruct), achieving an average 37.2% improvement in jailbreaking success rate. This demonstrates that human persuasion mechanisms pose a substantive threat to LLM alignment robustness. Our work establishes a new conceptual lens for AI safety research and provides an actionable, behaviorally grounded evaluation framework for assessing model vulnerability to psychologically informed adversarial prompts.
📝 Abstract
Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.