AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the limitations of existing automated red-teaming approaches, which are confined to optimizing fixed, human-crafted prompts and struggle to dynamically evolve attack strategies. The authors propose an agent-driven, program-level search framework that iteratively edits executable attack programs—rather than single prompts—in a gradient-free, black-box setting, leveraging evaluation feedback to automatically evolve the structural composition of attack strategies. This approach uniquely enables expansion of attack components and modification of control flow, transcending the expressivity constraints of conventional prompt optimization, all without requiring model fine-tuning, human annotation, or GPU resources. Experiments demonstrate an average 17.0-percentage-point improvement in attack success rate across eleven mainstream models, with gains of up to 16 percentage points on state-of-the-art models, underscoring the effectiveness and generality of program-level strategy search.

Technology Category

Application Category

📝 Abstract

Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.

Problem

Research questions and friction points this paper is trying to address.

red-teaming

attack strategy

large language models

prompt optimization

jailbreak

Innovation

Methods, ideas, or system contributions that make the work stand out.

strategy evolution

executable attack programs

agent-driven red-teaming