Prompt Optimization and Evaluation for LLM Automated Red Teaming

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the low efficiency and unstable vulnerability discovery of large language models (LLMs) in automated red-teaming, particularly in attack prompt generation. We propose a discoverability-based prompt optimization method that quantifies exploitability by estimating the expected success rate of a single attack on the target system via multi-random-seed sampling; this metric then guides iterative prompt refinement. Attack Success Rate (ASR) serves as the core evaluation metric, enhanced by target-environment randomization and repeated sampling to improve assessment robustness. Our key contribution is the first formal modeling of discoverability as a measurable, optimizable prompt quality metric—eliminating reliance on manual annotations or fixed benchmarks. Experiments demonstrate significant improvements in vulnerability identification rate and cross-model generalization, thereby enhancing both the effectiveness and stability of automated red-teaming.

Technology Category

Application Category

📝 Abstract
Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack's discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for LLM-based attack generation
Measuring attack discoverability via repeated executions
Improving robustness of automated red teaming evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes attack prompts using individual ASR
Measures attack discoverability via repeated executions
Reveals exploitable patterns for prompt refinement
🔎 Similar Papers
No similar papers found.
M
Michael Freenor
Fuel iX Applied Research, Charlottesville, USA
L
Lauren Alvarez
Fuel iX Applied Research, Charlottesville, USA; North Carolina State University, Raleigh, USA
M
Milton Leal
Fuel iX Applied Research, Charlottesville, USA
L
Lily Smith
Fuel iX Applied Research, Charlottesville, USA
J
Joel Garrett
Fuel iX Applied Research, Charlottesville, USA
Y
Yelyzaveta Husieva
Fuel iX Applied Research, Charlottesville, USA
M
Madeline Woodruff
Fuel iX Applied Research, Charlottesville, USA
R
Ryan Miller
Fuel iX Applied Research, Charlottesville, USA
Erich Kummerfeld
Erich Kummerfeld
Research Assistant Professor, Institute for Health Informatics, University of Minnesota
CausalityLatent VariablesMachine LearningPhilosophy of Science
R
Rafael Medeiros
TELUS Digital, Vancouver, CA
Sander Schulhoff
Sander Schulhoff
CEO, HackAPrompt | Learn Prompting
Natural Language ProcessingDeep Reinforcement LearningGenerative AIPrompt Injection