AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This work addresses the limitations of existing automated red-teaming approaches, which are confined to optimizing fixed, human-crafted prompts and struggle to dynamically evolve attack strategies. The authors propose an agent-driven, program-level search framework that iteratively edits executable attack programs—rather than single prompts—in a gradient-free, black-box setting, leveraging evaluation feedback to automatically evolve the structural composition of attack strategies. This approach uniquely enables expansion of attack components and modification of control flow, transcending the expressivity constraints of conventional prompt optimization, all without requiring model fine-tuning, human annotation, or GPU resources. Experiments demonstrate an average 17.0-percentage-point improvement in attack success rate across eleven mainstream models, with gains of up to 16 percentage points on state-of-the-art models, underscoring the effectiveness and generality of program-level strategy search.

Technology Category

Application Category

📝 Abstract
Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.
Problem

Research questions and friction points this paper is trying to address.

red-teaming
attack strategy
large language models
prompt optimization
jailbreak
Innovation

Methods, ideas, or system contributions that make the work stand out.

strategy evolution
executable attack programs
agent-driven red-teaming
black-box jailbreak
compositional program search
🔎 Similar Papers
No similar papers found.
💼 Related Jobs