Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study systematically investigates the interaction effects of sequentially chaining weak jailbreaking transformations—referred to as mutator chaining—and their implications for AI safety in large language models (LLMs). By constructing 12 foundational mutators and exhaustively evaluating all ordered pairwise combinations through adversarial testing on three mainstream LLMs, the work introduces novel metrics of persistence and effectiveness to quantitatively assess attack performance. The analysis reveals, for the first time, a highly non-uniform combinatorial effect: while most chained mutators underperform individual ones, a small subset exhibits significant synergistic gains. Furthermore, by examining failure patterns, the study infers structural constraints inherent in current safety alignment mechanisms, offering new insights into the robustness of LLMs and pathways for its improvement.

📝 Abstract

Jailbreaking attacks on large language models pose a significant threat to AI safety by enabling the generation of harmful or restricted content. While prior work has explored both handcrafted and automated jailbreak strategies, the potential for compositional interaction between simple attacks remains underexplored. This paper presents a systematic study of mutator chaining, in which weak jailbreak transformations are applied sequentially to characterize how they interact: whether they reinforce one another, interfere destructively, or produce no meaningful change. We implement twelve baseline mutators and evaluate all ordered pairs on a benchmark of harmful prompts against three popular LLM models. Our framework introduces metrics for completeness and validity that capture both transformation persistence and attack effectiveness. Results reveal that the interaction landscape is highly non-uniform, while most combinations fail to outperform individual mutators, exhibiting destructive interference or structural incompatibility, a small fraction produce synergistic effects that improve attack success rates. Equally important, the prevalent failure modes reveal structural properties of safety alignment that are not apparent from single-strategy evaluations. These findings highlight the nuanced dynamics of adversarial prompt composition and offer new insights for building more robust safety defenses.

Problem

Research questions and friction points this paper is trying to address.

jailbreaking

compositional attacks

mutator chaining

AI safety

LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional jailbreaking

mutator chaining

adversarial prompt composition