Defending Jailbreak Prompts via In-Context Adversarial Game

📅 2024-02-20
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack lightweight, adaptive defenses against jailbreaking attacks. Method: This paper proposes In-Context Adversarial Gaming (ICAG), a training-free framework that enables online, adaptive defense during inference. ICAG dynamically instantiates red-team (attacker) and blue-team (defender) agents within the user prompt context, engaging them in iterative adversarial interactions to detect and mitigate jailbreak attempts in real time—without model fine-tuning. Contribution/Results: ICAG introduces the first training-free, online adversarial mechanism, eliminating reliance on static datasets or retraining. It enables automatic defense evolution against emerging jailbreak techniques and exhibits strong cross-model generalization. Experiments demonstrate that ICAG significantly reduces success rates across diverse known jailbreaking attacks, while maintaining robust transferable defense against unseen attack types and multiple LLMs—including proprietary and open-weight models.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG’s efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism. The code is available at https://github.com/YujunZhou/In-Context-Adversarial-Game.
Problem

Research questions and friction points this paper is trying to address.

Defends against LLM jailbreak attacks dynamically
Uses adversarial game to enhance security
Improves defenses iteratively without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Adversarial Game
Dynamic knowledge extension
Iterative defense enhancement
🔎 Similar Papers
No similar papers found.