🤖 AI Summary
This paper addresses many-shot jailbreaking (MSJ)—a multi-turn adversarial attack that injects numerous fabricated harmful responses into long-context prompts to subvert LLM safety alignment. We propose a plug-and-play compositional defense framework that systematically integrates supervised fine-tuning (SFT), rejection sampling fine-tuning, rule- and LLM-driven input rewriting, and context-aware toxicity detection to jointly enhance input sanitization and model robustness. Our key contribution lies in achieving balanced security strengthening and functionality preservation: the framework reduces MSJ attack success rates by over 70% without degrading in-context learning or conversational capabilities, while retaining ≥98% of original performance on benchmarks including TruthfulQA and MT-Bench—substantially outperforming single-strategy defenses.
📝 Abstract
Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a ``fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the ``fake'' assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.