🤖 AI Summary
This study addresses the challenge of robustly evaluating safety filters in large language models (LLMs). We propose an efficient, interpretable method for automated jailbreak prompt generation. Methodologically, we design a multi-role collaborative LLM agent system that integrates semantic clustering with knowledge graph modeling to explicitly represent the structured characteristics of jailbreak strategies; notably, it is the first approach to automatically construct compliance test cases aligned with official safety guidelines (e.g., NIST AI Risk Management Framework, EU AI Act). Key contributions include: (1) introducing a role-driven multi-agent paradigm for jailbreak prompt generation; (2) building the first retrievable and reasoning-capable jailbreak knowledge graph; and (3) demonstrating cross-modal generalization—successfully transferring to multimodal models including MiniGPT-v2 and Gemini Vision Pro. Empirical evaluation on Vicuna, Llama-2, and ChatGPT shows significant improvements in jailbreak success rates, establishing a novel paradigm for AI safety evaluation.
📝 Abstract
The discovery of"jailbreaks"to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.