AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing LLM red-teaming heavily relies on seed instructions, resulting in semantically homogeneous and insufficiently diverse adversarial prompts. To address this, we propose AutoRed—the first framework for seed-free, open-ended adversarial prompt generation. AutoRed employs persona-guided initialization, reflective prompt refinement, and a lightweight dual verification mechanism combining rule-based filtering and model-based scoring—all without querying the target LLM. This enables efficient generation of highly diverse, effective attack prompts. Our method substantially improves attack coverage and cross-model generalizability. Leveraging AutoRed, we construct two high-quality benchmark datasets—AutoRed-Medium and AutoRed-Hard—and evaluate them across eight mainstream LLMs. Results demonstrate significant gains in both attack success rate and transferability, establishing a new paradigm for automated, scalable LLM security assessment.

Technology Category

Application Category

📝 Abstract

The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse adversarial prompts without seed instructions for LLM safety testing

Improving red teaming efficiency through automated prompt refinement and verification

Evaluating LLM vulnerabilities using free-form adversarial attack frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Free-form adversarial prompt generation without seed instructions

Persona-guided instruction generation with reflection loop

Harmfulness verifier avoids target model queries

🔎 Similar Papers

No similar papers found.