Metaphor Is Not All Attention Needs

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates why literary style transformations—such as poetic paraphrasing—can circumvent the safety mechanisms of large language models. Through a series of experiments on Qwen3-14B employing attention visualization, input ablation, attention map vectorization, clustering analysis, and linear probing, the authors demonstrate that poetic jailbreaking does not arise from the model’s inability to recognize poetic form. Instead, it results from a cumulative shift in processing patterns induced by stylistic irregularities, which operates independently of the model’s harmful content detection system. While the model accurately distinguishes poetry from prose, it struggles to predict jailbreak success within the same stylistic format. Clustering reveals clear separation by literary style but no correlation with safety labels, exposing a blind spot in current safety mechanisms regarding style-induced semantic drift.

📝 Abstract

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.

Problem

Research questions and friction points this paper is trying to address.

literary jailbreak

poetic transformation

safety mechanisms

stylistic reformulation

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

literary jailbreak

attention interpretability

stylistic irregularity