Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a universal single-turn jailbreaking vulnerability wherein adversarial poetic prompts systematically bypass safety alignment mechanisms of mainstream large language models (LLMs), achieving up to 90% success rates in high-risk domains such as CBRN and cyberattack generation. Method: We introduce a standardized meta-prompting framework that automatically transduces harmful queries into poetic form, enabling cross-model and cross-safety-training transfer attacks. The approach integrates adversarial prompt engineering, ensemble evaluation via open-weight discriminative models, and dual-criteria human annotation for rigorous validation. Contribution/Results: We are the first to demonstrate that stylistic transformation—specifically poeticization—constitutes an effective, keyword- and structure-agnostic jailbreaking pathway. Experiments across 25 state-of-the-art LLMs show manual poetic jailbreaking achieves 62% success; automated translation reaches 43%, outperforming baselines by up to 18×. These results establish poetry as a novel, generalizable jailbreaking paradigm with broad applicability across model families and alignment strategies.

Technology Category

Application Category

📝 Abstract
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
Problem

Research questions and friction points this paper is trying to address.

Adversarial poetry bypasses safety mechanisms in large language models
Poetic prompts achieve high attack success rates across multiple risk domains
Stylistic variation reveals systematic vulnerabilities in current alignment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial poetry functions as universal jailbreak technique
Poetic prompts yield high attack-success rates across models
Stylistic variation circumvents contemporary safety mechanisms
🔎 Similar Papers
No similar papers found.
Piercosma Bisconti
Piercosma Bisconti
Assistant Professor, Sapienza University of Rome & DEXAI - Artificial Ethics
Political PhilosophyAI TrustworthinessHuman-Robot interactionsPhilosophy of Technology
M
Matteo Prandi
DEXAI – Icaro Lab, Sapienza University of Rome
F
Federico Pierucci
DEXAI – Icaro Lab, Sant’Anna School of Advanced Studies
F
Francesco Giarrusso
DEXAI – Icaro Lab, Sapienza University of Rome
M
Marcantonio Bracale
DEXAI – Icaro Lab, Sapienza University of Rome
M
Marcello Galisai
DEXAI – Icaro Lab, Sapienza University of Rome
Vincenzo Suriani
Vincenzo Suriani
Sapienza University of Rome
O
Olga Sorokoletova
Sapienza University of Rome
F
Federico Sartore
DEXAI – Icaro Lab, Sapienza University of Rome
Daniele Nardi
Daniele Nardi
Sapienza Univ. Roma, Dept. Computer, Control and Management Engineering
Artificial IntelligenceRoboticsMulti Agent Systems