RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing LLM red-teaming methods suffer from limited attack diversity, poor scalability, and low efficiency. To address these limitations, this paper proposes an evolutionary Quality-Diversity (QD) adversarial prompt generation framework. It introduces a novel multi-element archive and concurrent multi-objective fitness evaluation mechanism—departing from conventional single-prompt archives and pairwise comparisons—and integrates multidimensional prompt quality assessment into an improved MAP-Elites algorithm with a parallelized generation-and-filtering architecture. Evaluated across six benchmark datasets and four open-source LLMs, the framework achieves a Diverse-Score of 0.84, improving prompt diversity by two orders of magnitude. On HarmBench, it attains a mean attack success rate of 81.1%, surpassing state-of-the-art methods by 3.9 percentage points, while requiring only 1.45 hours of computation time.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

Problem

Research questions and friction points this paper is trying to address.

Enhancing adversarial prompt generation for LLMs

Overcoming scalability and diversity in attack strategies

Improving attack success rate and prompt diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary quality-diversity search for prompts

Multi-element archive storing diverse high-quality prompts

Comprehensive fitness function evaluating multiple prompts

🔎 Similar Papers

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement