ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts

📅 2024-07-12

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing LLM safety evaluation methods often generate unnatural and pragmatically infeasible adversarial prompts. This work addresses the problem of generating realistic, high-toxicity prompts under black-box constraints. We propose a weakly supervised red-teaming framework that— for the first time—formulates red-teaming as a reinforcement learning problem jointly optimizing toxicity elicitation and autoregressive plausibility. We design an online weakly supervised Identity Preference Optimization (IPO) algorithm that simultaneously maximizes prompt toxicity and minimizes perplexity (i.e., maximizes generation probability) in black-box settings. Evaluated on models ranging from 137M to 7.8B parameters, our method achieves 2–23× higher toxicity trigger rates than baselines; generated prompts exhibit significantly lower perplexity than those from both automated and human-authored attacks, and yield 5.4–14× greater toxicity improvements in black-box scenarios. Furthermore, the synthesized negative samples substantially enhance downstream safety fine-tuning performance.

Technology Category

Application Category

📝 Abstract

Conventional approaches for the automated red-teaming of large language models (LLMs) aim to identify prompts that elicit toxic outputs from a frozen language model (the defender). This often results in the prompting model (the adversary) producing text that is unlikely to arise during autoregression. In response, we propose a reinforcement learning formulation of LLM red-teaming designed to discover prompts that both (1) elicit toxic outputs from a defender and (2) have low perplexity as scored by that defender. These prompts are the most pertinent in a red-teaming setting because the defender generates them with high probability. We solve this formulation with an online and weakly supervised form of Identity Preference Optimization (IPO), attacking models ranging from 137M to 7.8B parameters. Our policy performs competitively, producing prompts that induce defender toxicity at a rate of 2-23 times higher than baseline across model scales. Importantly, these prompts have lower perplexity than both automatically generated and human-written attacks. Furthermore, our method creates black-box attacks with 5.4-14 times increased toxicity. To assess the downstream utility of our method, we use rollouts from our policy as negative examples for downstream toxicity tuning and demonstrate improved safety.

Problem

Research questions and friction points this paper is trying to address.

Language Model Safety

Harmful Output

Low Perplexity Prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Prompt Engineering

Toxicity Optimization

🔎 Similar Papers

Efficient Detection of Toxic Prompts in Large Language Models