XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

πŸ“… 2025-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LLM safety evaluations predominantly rely on binary (safe/unsafe) classification, failing to capture the nuanced risk gradient of extremist content. Method: We introduce XGUARD, a benchmark built upon 3,840 real-world, extremist-related red-teaming prompts, featuring a fine-grained, five-level danger scale (0–4). We propose a novel tiered safety evaluation framework and the Attack Severity Curve (ASC)β€”an interpretable metric enabling risk modeling and cross-intensity comparison of defense strategies. Our methodology integrates social-media- and news-driven prompt engineering, multi-level human annotation, visualized evaluation, and lightweight defense validation. Results: Experiments across six mainstream LLMs and two defense approaches reveal systemic ideological safety gaps; moreover, model robustness exhibits a significant trade-off with expressive capability.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe and unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs. XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios. Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate six popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating extremist content severity in LLM outputs
Developing nuanced safety benchmarks beyond binary labels
Assessing trade-offs between model robustness and freedom
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graded benchmark for extremist content evaluation
Five danger levels categorize model responses
Attack Severity Curve visualizes vulnerabilities
πŸ”Ž Similar Papers
No similar papers found.