Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming

📅 2025-02-22
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of assessing safety risks associated with harmful content generation by large language models (LLMs). We propose the first automated red-teaming prompt generation framework featuring breadth–depth co-evolution. The breadth path expands prompt coverage via enhanced in-context learning, while the depth path enables fine-grained semantic and syntactic customization through structured transformation operations; their joint optimization improves prompt quality, quantity, and diversity. The framework incorporates multi-dimensional diversity metrics and a systematic, cross-model and cross-topic safety evaluation protocol. Experiments across eight mainstream LLMs and eight sensitive topics—using 4,800 RTPE prompts—demonstrate statistically significant improvements in both attack success rate and prompt diversity over existing automated red-teaming methods. Our approach thus enables scalable, robust, and comprehensive LLM safety evaluation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.
Problem

Research questions and friction points this paper is trying to address.

Automate red teaming for LLMs
Enhance prompt diversity and quality
Analyze LLMs across sensitive topics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable evolution framework RTPE
Enhanced in-context learning method
Customized transformation operations
🔎 Similar Papers
No similar papers found.
R
Rui Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
P
Peiyi Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jingyuan Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
D
Di Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Lei Sha
Lei Sha
Prof@Beihang University, Prof@ZGC Lab, Oxtium AI, University of Oxford
NLPML