Defending Jailbreak Prompts via In-Context Adversarial Game

📅 2024-02-20

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 8

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language models (LLMs) lack lightweight, adaptive defenses against jailbreaking attacks. Method: This paper proposes In-Context Adversarial Gaming (ICAG), a training-free framework that enables online, adaptive defense during inference. ICAG dynamically instantiates red-team (attacker) and blue-team (defender) agents within the user prompt context, engaging them in iterative adversarial interactions to detect and mitigate jailbreak attempts in real time—without model fine-tuning. Contribution/Results: ICAG introduces the first training-free, online adversarial mechanism, eliminating reliance on static datasets or retraining. It enables automatic defense evolution against emerging jailbreak techniques and exhibits strong cross-model generalization. Experiments demonstrate that ICAG significantly reduces success rates across diverse known jailbreaking attacks, while maintaining robust transferable defense against unseen attack types and multiple LLMs—including proprietary and open-weight models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG’s efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism. The code is available at https://github.com/YujunZhou/In-Context-Adversarial-Game.

Problem

Research questions and friction points this paper is trying to address.

Defends against LLM jailbreak attacks dynamically

Uses adversarial game to enhance security

Improves defenses iteratively without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Adversarial Game

Dynamic knowledge extension

Iterative defense enhancement

🔎 Similar Papers

No similar papers found.