GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

📅 2024-02-05

🏛️ arXiv.org

📈 Citations: 22

✨ Influential: 2

career value

178K/year

🤖 AI Summary

This study addresses the challenge of robustly evaluating safety filters in large language models (LLMs). We propose an efficient, interpretable method for automated jailbreak prompt generation. Methodologically, we design a multi-role collaborative LLM agent system that integrates semantic clustering with knowledge graph modeling to explicitly represent the structured characteristics of jailbreak strategies; notably, it is the first approach to automatically construct compliance test cases aligned with official safety guidelines (e.g., NIST AI Risk Management Framework, EU AI Act). Key contributions include: (1) introducing a role-driven multi-agent paradigm for jailbreak prompt generation; (2) building the first retrievable and reasoning-capable jailbreak knowledge graph; and (3) demonstrating cross-modal generalization—successfully transferring to multimodal models including MiniGPT-v2 and Gemini Vision Pro. Empirical evaluation on Vicuna, Llama-2, and ChatGPT shows significant improvements in jailbreak success rates, establishing a novel paradigm for AI safety evaluation.

Technology Category

Application Category

📝 Abstract

The discovery of"jailbreaks"to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

Problem

Research questions and friction points this paper is trying to address.

Generating natural-language jailbreaks to test LLM safety filters

Proposing a role-playing system for collaborative jailbreak creation

Automating guideline adherence testing for LLMs using government standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Role-playing system for jailbreak generation

Knowledge graph from clustered jailbreak characteristics

Automatic guideline-adherence testing setting

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Senior Test Automation AI Engineer (Automation & Operations) - Vice President - Dallas

Goldman Sachs

Dallas, TX, United States

Authors to Follow