PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing automated red-teaming approaches overlook the influence of user identity and contextual background on attack strategy design, resulting in insufficient diversity in risk detection. This paper proposes PersonaTeaming, the first framework to systematically integrate identity-specific characteristics—such as expert versus novice user profiles—into automated red-teaming pipelines. It introduces a dynamic persona generation algorithm and formalizes “mutation distance” to quantitatively measure prompt evolution across adversarial iterations. By coupling identity-driven adversarial prompt mutation with a diversity-preserving mechanism, PersonaTeaming significantly enhances attack efficacy. Experimental evaluation across multiple AI models demonstrates up to a 144.1% improvement in attack success rate over baseline methods, while maintaining high prompt diversity. PersonaTeaming consistently outperforms the state-of-the-art RainbowPlus across all metrics, establishing new benchmarks for identity-aware, robust red-teaming.

Technology Category

Application Category

📝 Abstract

Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

Problem

Research questions and friction points this paper is trying to address.

Introducing personas to enhance automated AI red-teaming effectiveness

Addressing identity role gap in current automated red-teaming methods

Expanding adversarial strategy spectrum through persona-based prompt mutation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing personas in adversarial prompt generation

Dynamic persona-generating algorithm adaptive to prompts

New metrics measuring mutation distance for diversity

🔎 Similar Papers

No similar papers found.