Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses multi-example jailbreaking attacks, wherein a series of harmful question-answer demonstrations are prepended to a malicious query to steer aligned language models toward generating prohibited content. The study provides the first theoretical characterization of this attack as an implicit adversarial fine-tuning process, revealing a progressive activation shift in the model’s representation space. Building on this insight, the authors propose a lightweight, inference-time defense that requires neither model parameter updates nor white-box access: appending a single safe example to the input context suffices to induce a corrective “safety update,” effectively restoring the model’s ability to reject harmful requests. This approach significantly enhances robustness against such attacks while maintaining high efficiency and practical deployability.

📝 Abstract

Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.

Problem

Research questions and friction points this paper is trying to address.

many-shot jailbreak

safety alignment

language models

harmful queries

activation drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

many-shot jailbreaking

activation drift

implicit malicious fine-tuning