🤖 AI Summary
This work addresses a critical limitation in existing automated red-teaming approaches for generative AI models—their neglect of user identity and contextual background, which hinders comprehensive risk discovery. To overcome this, the authors propose the first persona-integrated red-teaming framework, featuring a role-driven adversarial prompt generation algorithm and an interactive Playground interface. This system enables both automated synthesis and expert-defined personas, facilitating human-AI collaborative exploration of high-risk scenarios. Experimental results demonstrate that the proposed method achieves higher attack success rates than RainbowPlus while preserving prompt diversity. Furthermore, user studies confirm that the framework effectively elicits diverse strategic behaviors and fosters creative reasoning among evaluators.
📝 Abstract
Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work, we explore persona-driven red-teaming to advance both automated red-teaming and human-AI collaboration. We first develop PersonaTeaming Workflow, which incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, a state-of-the-art automated red-teaming method, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. However, since automated personas only approximate real human perspectives, we further instantiate PersonaTeaming Workflow as PersonaTeaming Playground, a user-facing interface that enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts. In a user study with 11 industry practitioners, we found that PersonaTeaming Playground enabled diverse red-teaming strategies and outputs that practitioners perceived as useful, and that AI-generated suggestions in the PersonaTeaming Playground encouraged out-of-the-box thinking even when practitioners did not follow them strictly. Together, our work advances both automated and human-in-the-loop approaches to red-teaming, while shedding light on interaction patterns and design insights for supporting human-AI collaboration in generative AI red-teaming.