๐ค AI Summary
This work addresses multi-example jailbreaking attacks, wherein a series of harmful question-answer demonstrations are prepended to a malicious query to steer aligned language models toward generating prohibited content. The study provides the first theoretical characterization of this attack as an implicit adversarial fine-tuning process, revealing a progressive activation shift in the modelโs representation space. Building on this insight, the authors propose a lightweight, inference-time defense that requires neither model parameter updates nor white-box access: appending a single safe example to the input context suffices to induce a corrective โsafety update,โ effectively restoring the modelโs ability to reject harmful requests. This approach significantly enhances robustness against such attacks while maintaining high efficiency and practical deployability.
๐ Abstract
Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.