GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of operationalizing governmental AI ethics guidelines into quantifiable safety evaluations. We propose the first automated framework that systematically translates high-level ethical principles into structured test cases: (1) constructing a multidimensional violation question bank via rule-based and generative methods; (2) simulating jailbreaking scenarios through adversarial role-playing and dynamic response analysis; and (3) introducing GUARD-JD, the first jailbreak diagnostic mechanism for detecting latent ethical violations. The framework supports cross-modal transfer testing for both text- and vision-language models. Evaluated on seven mainstream large language and multimodal models, it demonstrates strong efficacy in identifying both explicit and implicit ethical violations, generating fine-grained compliance reports, and significantly enhancing the operationalizability and depth of AI ethics assessments.

Technology Category

Application Category

📝 Abstract
As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD ( extbf{G}uideline extbf{U}pholding Test through extbf{A}daptive extbf{R}ole-play and Jailbreak extbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.
Problem

Research questions and friction points this paper is trying to address.

Translating high-level AI ethics guidelines into actionable testing questions
Assessing LLM compliance with government-issued ethical guidelines
Identifying potential scenarios that bypass built-in safety mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated generation of guideline-violating questions
Integration of jailbreak diagnostics for safety testing
Compliance reporting with adherence and violation delineation
🔎 Similar Papers
No similar papers found.
Haibo Jin
Haibo Jin
HKUST
Computer VisionMedical Image AnalysisVision-Language Modeling
Ruoxi Chen
Ruoxi Chen
Zhejiang University of Technology
Trustworthy AIMultimodal Models
P
Peiyan Zhang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.
A
Andy Zhou
Lapis Labs, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA.
Y
Yang Zhang
School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA.
Haohan Wang
Haohan Wang
School of Information Sciences, University of Illinois Urbana-Champaign
Computational BiologyAgentic AIAI4ScienceAI security