Metaphor Is Not All Attention Needs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This study investigates why literary style transformations—such as poetic paraphrasing—can circumvent the safety mechanisms of large language models. Through a series of experiments on Qwen3-14B employing attention visualization, input ablation, attention map vectorization, clustering analysis, and linear probing, the authors demonstrate that poetic jailbreaking does not arise from the model’s inability to recognize poetic form. Instead, it results from a cumulative shift in processing patterns induced by stylistic irregularities, which operates independently of the model’s harmful content detection system. While the model accurately distinguishes poetry from prose, it struggles to predict jailbreak success within the same stylistic format. Clustering reveals clear separation by literary style but no correlation with safety labels, exposing a blind spot in current safety mechanisms regarding style-induced semantic drift.
📝 Abstract
Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.
Problem

Research questions and friction points this paper is trying to address.

literary jailbreak
poetic transformation
safety mechanisms
stylistic reformulation
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

literary jailbreak
attention interpretability
stylistic irregularity
safety alignment
poetic devices
🔎 Similar Papers
No similar papers found.
O
Olga Sorokoletova
1Sapienza University of Rome, Department of Computer, Control and Management Engineering; 2DEXAI – Icaro Lab
F
Francesco Giarrusso
1Sapienza University of Rome, Department of Computer, Control and Management Engineering; 2DEXAI – Icaro Lab
G
Giacomo De Luca
3University of Rome Tor Vergata
Piercosma Bisconti
Piercosma Bisconti
Assistant Professor, Sapienza University of Rome & DEXAI - Artificial Ethics
Political PhilosophyAI TrustworthinessHuman-Robot interactionsPhilosophy of Technology
M
Matteo Prandi
1Sapienza University of Rome, Department of Computer, Control and Management Engineering; 2DEXAI – Icaro Lab
F
Federico Pierucci
2DEXAI – Icaro Lab; 4Sant’Anna School of Advanced Studies
M
Marcello Galisai
1Sapienza University of Rome, Department of Computer, Control and Management Engineering; 2DEXAI – Icaro Lab
Vincenzo Suriani
Vincenzo Suriani
Sapienza University of Rome
Daniele Nardi
Daniele Nardi
Sapienza Univ. Roma, Dept. Computer, Control and Management Engineering
Artificial IntelligenceRoboticsMulti Agent Systems