Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

๐Ÿ“… 2025-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM safety evaluations suffer from insufficient diversity in adversarial prompt generation and incomplete coverage of safety risks. Method: This paper proposes an automated red-teaming framework featuring a novel behavior-conditioned training paradigm and an open-ended behavioral replay buffer. It establishes a multi-expert collaborative attack architecture integrating behavioral modeling, multi-task specialized training, embedding-space disentangled sampling, and dynamic buffer replayโ€”enabling goal-directed diversity in attack styles and safety risk categories. Contribution/Results: Evaluated on GPT-2, Llama-3, Gemma-2, and Qwen2.5, the framework achieves a 27% improvement in attack effectiveness, 3.1ร— higher strategy diversity than baseline methods, and comprehensive coverage of nine safety risk categories. It significantly enhances the systematicity and robustness of LLM safety evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
Ensuring safety of large language models (LLMs) is important. Red teaming--a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs--has emerged as a crucial safety evaluation method. Within this framework, the diversity of adversarial prompts is essential for comprehensive safety assessments. We find that previous approaches to red-teaming may suffer from two key limitations. First, they often pursue diversity through simplistic metrics like word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. Second, the common practice of training a single attacker model restricts coverage across potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. Additionally, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including GPT-2, Llama-3, Gemma-2, and Qwen2.5. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Ensuring safety of large language models through diverse adversarial prompts
Overcoming simplistic diversity metrics in red-teaming approaches
Generating specialized attackers for varied attack styles and risks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavior-conditioned training for goal-driven diversity
Open-ended behavioral replay buffer implementation
Multiple specialized attackers for diverse risk categories
๐Ÿ”Ž Similar Papers
No similar papers found.