🤖 AI Summary
This study addresses the limitations of existing safety evaluation benchmarks for large language models, which are predominantly English-centric and insufficient for assessing cultural sensitivity and localized harms. The authors construct a cross-cultural safety benchmark encompassing 10 country–language pairs and 5,500 test cases, introducing two novel metrics—Neutral-Safe Rate and Cultural Sensitivity Rate—to distinguish between universal harms and culturally embedded sensitive content. Through a multi-stage construction pipeline involving model-assisted discovery, automated validation, and dual-native annotation, along with a unified evaluation framework, they assess 10 frontier models and 27 localized models. The evaluation reveals a decoupling between jailbreak robustness and cultural awareness in frontier models and demonstrates that the apparent safety of many localized models often stems from generation failures rather than genuine alignment.
📝 Abstract
Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.