Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Iterative jailbreaking attacks—where adversaries repeatedly rewrite prompts and conduct trial-and-error to elicit harmful outputs from large language models (LLMs)—exploit dynamic feedback loops that existing defenses fail to actively disrupt. Method: We propose the first online-learning-based dynamic prompt optimization framework for jailbreaking defense, integrating reinforcement learning with direction-aware gradient suppression (PDGD). It models the discriminative boundary between harmful and harmless prompts in real time during inference, suppresses local overfitting, and enables adaptive evolution of defense policies. Results: Evaluated on three mainstream LLMs, our method significantly outperforms five baseline defenses across five representative iterative jailbreaking attack types, demonstrating superior robustness while simultaneously improving response quality on benign tasks. Its core innovation lies in pioneering the application of online learning to jailbreaking defense—shifting from passive filtering to proactive, context-aware guidance.

Technology Category

Application Category

📝 Abstract

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against iterative jailbreak attacks using dynamic prompt optimization

Disrupting trial-and-error attack cycles via reinforcement learning and gradient damping

Maintaining harmless task performance while rejecting harmful prompts through online learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online learning dynamically updates defense strategies

Reinforcement learning optimizes prompts for safety

Past-Direction Gradient Damping prevents overfitting attacks

🔎 Similar Papers

Defending Jailbreak Prompts via In-Context Adversarial Game

2024-02-20Conference on Empirical Methods in Natural Language ProcessingCitations: 8

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Research Engineer, Machine Learning (Reinforcement Learning)

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

Authors to Follow