🤖 AI Summary
This work addresses the limitations of current automated red-teaming approaches for large language models, which often suffer from narrow attack scenarios and insufficient generation of defensive data. The authors propose a Multi-Identity Conditional Adversarial Prompting (PCAP) framework that introduces diverse attacker personas—such as doctors, students, and malicious actors—along with tailored strategy sets to enable parallel, identity-conditioned adversarial search. This approach efficiently uncovers transferable jailbreak attacks across contexts and automatically generates high-quality safety fine-tuning data enriched with metadata. By integrating identity conditioning into adversarial prompt generation for the first time, PCAP establishes an end-to-end pipeline from vulnerability discovery to automatic model alignment. Evaluated on GPT-OSS 120B, the method boosts attack success rates from 57% to 97% and increases prompt diversity by 2–6×; after fine-tuning, the model achieves a recall of 0.99, an F1 score of 0.96, and negligible false-positive rates.
📝 Abstract
Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.