Mitigating Many-Shot Jailbreaking

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This paper addresses many-shot jailbreaking (MSJ)—a multi-turn adversarial attack that injects numerous fabricated harmful responses into long-context prompts to subvert LLM safety alignment. We propose a plug-and-play compositional defense framework that systematically integrates supervised fine-tuning (SFT), rejection sampling fine-tuning, rule- and LLM-driven input rewriting, and context-aware toxicity detection to jointly enhance input sanitization and model robustness. Our key contribution lies in achieving balanced security strengthening and functionality preservation: the framework reduces MSJ attack success rates by over 70% without degrading in-context learning or conversational capabilities, while retaining ≥98% of original performance on benchmarks including TruthfulQA and MT-Bench—substantially outperforming single-strategy defenses.

Technology Category

Application Category

📝 Abstract

Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a ``fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the ``fake'' assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.

Problem

Research questions and friction points this paper is trying to address.

Mitigating adversarial many-shot jailbreaking attacks on LLMs

Evaluating fine-tuning and input sanitization for safety

Balancing attack resistance with model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines fine tuning and input sanitization

Reduces many-shot jailbreaking effectiveness

Retains model performance in benign tasks

🔎 Similar Papers

No similar papers found.