Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing safety evaluation methods for large language models rely heavily on expert knowledge, lack systematicity, and quickly become outdated, making it difficult to comprehensively cover policy-violating scenarios. This work proposes POLARIS, a novel framework that introduces specification-based software testing paradigms into AI safety for the first time. By formalizing natural language safety policies into first-order logic, constructing semantic policy graphs, and systematically traversing them to generate executable test cases, POLARIS establishes a verifiable traceability chain from high-level policies to concrete tests. The approach enables automatic discovery of compositional violation scenarios and significantly outperforms existing baselines in both policy coverage and attack success rate, thereby enabling more systematic and reproducible safety evaluations of large language models.

📝 Abstract

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

Problem

Research questions and friction points this paper is trying to address.

AI safety

Large Language Models

safety evaluation

policy compliance

systematic testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

specification-based testing

First-Order Logic

Semantic Policy Graph