Mitigating Many-Shot Jailbreaking

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses many-shot jailbreaking (MSJ)—a multi-turn adversarial attack that injects numerous fabricated harmful responses into long-context prompts to subvert LLM safety alignment. We propose a plug-and-play compositional defense framework that systematically integrates supervised fine-tuning (SFT), rejection sampling fine-tuning, rule- and LLM-driven input rewriting, and context-aware toxicity detection to jointly enhance input sanitization and model robustness. Our key contribution lies in achieving balanced security strengthening and functionality preservation: the framework reduces MSJ attack success rates by over 70% without degrading in-context learning or conversational capabilities, while retaining ≥98% of original performance on benchmarks including TruthfulQA and MT-Bench—substantially outperforming single-strategy defenses.

Technology Category

Application Category

📝 Abstract
Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a ``fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the ``fake'' assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.
Problem

Research questions and friction points this paper is trying to address.

Mitigating adversarial many-shot jailbreaking attacks on LLMs
Evaluating fine-tuning and input sanitization for safety
Balancing attack resistance with model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines fine tuning and input sanitization
Reduces many-shot jailbreaking effectiveness
Retains model performance in benign tasks
🔎 Similar Papers
No similar papers found.