PolyGuard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing guardrail models and benchmarks rely on non-standardized risk taxonomies and overlook domain-specific safety regulations, resulting in poor policy alignment and weak robustness. To address this, we introduce the first large-scale, multi-domain (covering finance, law, code, and six other domains) guardrail dataset grounded in real-world safety policies. We propose a novel “policy-anchored” risk modeling framework, integrating policy-guided synthetic data generation, adversarial attack augmentation, detoxified prompt optimization, and multi-turn dialogue modeling. Comprehensive evaluation of 19 state-of-the-art guardrail models reveals systematic deficiencies in cross-domain consistency, common-risk detection, and adversarial robustness. Our analysis provides the first empirical evidence of critical gaps in policy traceability and robust refusal control—highlighting fundamental limitations in current approaches to safety-aligned AI governance.

Technology Category

Application Category

📝 Abstract

As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce PolyGuard, the first massive multi-domain safety policy-grounded guardrail dataset. PolyGuard offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) extbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on PolyGuard, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.

Problem

Research questions and friction points this paper is trying to address.

Ensuring LLM content safety across diverse domains

Addressing domain-specific risks in guardrail benchmarks

Improving resilience against adversarial attacks on guardrails

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-domain safety policy-grounded dataset

Diverse interaction formats and adversarial inputs

Advanced benign data curation via detoxification

🔎 Similar Papers

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models