🤖 AI Summary
This study identifies a universal single-turn jailbreaking vulnerability wherein adversarial poetic prompts systematically bypass safety alignment mechanisms of mainstream large language models (LLMs), achieving up to 90% success rates in high-risk domains such as CBRN and cyberattack generation.
Method: We introduce a standardized meta-prompting framework that automatically transduces harmful queries into poetic form, enabling cross-model and cross-safety-training transfer attacks. The approach integrates adversarial prompt engineering, ensemble evaluation via open-weight discriminative models, and dual-criteria human annotation for rigorous validation.
Contribution/Results: We are the first to demonstrate that stylistic transformation—specifically poeticization—constitutes an effective, keyword- and structure-agnostic jailbreaking pathway. Experiments across 25 state-of-the-art LLMs show manual poetic jailbreaking achieves 62% success; automated translation reaches 43%, outperforming baselines by up to 18×. These results establish poetry as a novel, generalizable jailbreaking paradigm with broad applicability across model families and alignment strategies.
📝 Abstract
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.