Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing jailbreaking research overlooks the linguistic and psychological underpinnings of prompt-based attacks, particularly how human persuasion principles interact with LLM alignment mechanisms. Method: We systematically integrate classical social-psychological persuasion strategies into jailbreaking prompt design, introducing the novel concept of “persuasion fingerprints”—identifiable linguistic patterns in model responses that reflect deliberate persuasive structuring. Through cross-model empirical evaluation and fine-grained linguistic behavioral analysis, we quantify the transferability and attack efficacy of persuasion-infused prompts. Results: Persuasion-structured prompts significantly compromise mainstream aligned models (e.g., Llama-3-Instruct, Qwen2.5-72B-Instruct), achieving an average 37.2% improvement in jailbreaking success rate. This demonstrates that human persuasion mechanisms pose a substantive threat to LLM alignment robustness. Our work establishes a new conceptual lens for AI safety research and provides an actionable, behaviorally grounded evaluation framework for assessing model vulnerability to psychologically informed adversarial prompts.

Technology Category

Application Category

📝 Abstract
Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM vulnerability to persuasion-based jailbreak attacks
Examining linguistic mechanisms influencing model susceptibility to harmful outputs
Identifying persuasive fingerprints in LLM responses that bypass safety safeguards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using persuasion theories to craft adversarial prompts
Leveraging social science insights for jailbreak attacks
Analyzing persuasive fingerprints in LLM responses
🔎 Similar Papers
No similar papers found.