Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limitations of existing black-box jailbreaking attacks, which lack a systematic model of attack experience and struggle to generate adversarial prompts efficiently. To overcome this, the authors propose MemoAttack, a novel framework that formalizes attack experience as structured skill-based memory units. MemoAttack introduces an evidence-driven memory lifecycle mechanism and a context-aware Thompson sampling strategy to enable dynamic memory evolution and intelligent reuse. Evaluated on AdvBench, the method achieves a 98.00% attack success rate—surpassing the strongest baseline by 16.67 percentage points—while reducing API queries by 45.9%. Moreover, its performance consistently improves as memory accumulates over time.

📝 Abstract

Jailbreak attacks on large language models (LLMs) aim to induce LLMs to produce content that they are expected to refuse. Automated black-box jailbreak generation is especially important for safety evaluation, where the attacker observes only model outputs and needs to automatically search for effective adversarial prompts. Existing black-box jailbreak methods either depend on sample-wise heuristic search or leverage attack experience through accumulating strategy pools or method libraries, lacking a systematic organization and management of attack experience. To mitigate these drawbacks, we propose MemoAttack, a memory-driven black-box jailbreak framework with comprehensive attack memory modeling, evolution, and selection. Specifically, MemoAttack comprises three key designs: (1) Skill-Structured Memory Modeling, which abstracts accumulated attack experience into reusable skill-structured attack memory whose units pair attack skills with templates, evidence, and lifecycle state; (2) Lifecycle-Driven Memory Evolution, which evolves the memory through evidence-based probation, promotion, retirement, reactivation, elimination, and storage cleanup; and (3) Explore-Exploit Balanced Memory Selection, which balances reliable memory reuse with uncertainty-driven exploration via contextual Thompson Sampling. Experiments on AdvBench demonstrate that MemoAttack achieves an average attack success rate of 98.00%, outperforming the strongest baseline by 16.67 percentage points, while reducing request count by 45.9%. Moreover, MemoAttack continuously improves as memory accumulates over more samples.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

black-box attack

attack memory

large language models

adversarial prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Skill-Structured Memory

Memory Evolution

Black-Box Jailbreaking