Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

๐Ÿ“… 2026-05-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

254K/year
๐Ÿค– AI Summary
This work addresses multi-example jailbreaking attacks, wherein a series of harmful question-answer demonstrations are prepended to a malicious query to steer aligned language models toward generating prohibited content. The study provides the first theoretical characterization of this attack as an implicit adversarial fine-tuning process, revealing a progressive activation shift in the modelโ€™s representation space. Building on this insight, the authors propose a lightweight, inference-time defense that requires neither model parameter updates nor white-box access: appending a single safe example to the input context suffices to induce a corrective โ€œsafety update,โ€ effectively restoring the modelโ€™s ability to reject harmful requests. This approach significantly enhances robustness against such attacks while maintaining high efficiency and practical deployability.
๐Ÿ“ Abstract
Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.
Problem

Research questions and friction points this paper is trying to address.

many-shot jailbreak
safety alignment
language models
harmful queries
activation drift
Innovation

Methods, ideas, or system contributions that make the work stand out.

many-shot jailbreaking
activation drift
implicit malicious fine-tuning
one-shot safety demonstration
inference-time defense
๐Ÿ”Ž Similar Papers
No similar papers found.