From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

📅 2025-12-16
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes “adversarial storytelling,” a novel jailbreaking technique that embeds harmful instructions within cyberpunk narratives to circumvent the safety mechanisms of large language models. By prompting models to perform functional analysis grounded in Propp’s morphological theory of folktales, the method exploits cultural narrative structures to mislead models into interpreting malicious content as legitimate textual interpretation. Integrating narratological theory, adversarial prompt engineering, and mechanistic interpretability, the approach achieves an average attack success rate of 71.3% across 26 state-of-the-art models from nine providers. These findings reveal that culturally structured jailbreaks constitute a pervasive vulnerability, highlighting a critical deficiency in models’ ability to discern harmful intent at a deeper semantic level. The study underscores the urgent need for interpretability research to elucidate how narrative cues reshape internal model representations and compromise safety alignment.

Technology Category

Application Category

📝 Abstract
Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
Problem

Research questions and friction points this paper is trying to address.

adversarial attacks
large language models
jailbreak
cultural framing
safety mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Tales
jailbreak
mechanistic interpretability
cultural framing
LLM safety
🔎 Similar Papers
Piercosma Bisconti
Piercosma Bisconti
Assistant Professor, Sapienza University of Rome & DEXAI - Artificial Ethics
Political PhilosophyAI TrustworthinessHuman-Robot interactionsPhilosophy of Technology
M
Marcello Galisai
DEXAI – Icaro Lab; Sapienza University of Rome
M
Matteo Prandi
DEXAI – Icaro Lab; Sapienza University of Rome
F
Federico Pierucci
DEXAI – Icaro Lab; Sant’Anna School of Advanced Studies
O
Olga E. Sorokoletova
DEXAI – Icaro Lab; Sapienza University of Rome
F
Francesco Giarrusso
DEXAI – Icaro Lab; Sapienza University of Rome
Vincenzo Suriani
Vincenzo Suriani
Sapienza University of Rome
M
Marcantonio Bracale Syrnikov
DEXAI – Icaro Lab; VU Amsterdam
Daniele Nardi
Daniele Nardi
Sapienza Univ. Roma, Dept. Computer, Control and Management Engineering
Artificial IntelligenceRoboticsMulti Agent Systems