🤖 AI Summary
Current safety evaluations of large language models (LLMs) predominantly rely on translated benchmarks, which fail to capture how the interplay between language and geopolitical context influences safety risks. This work proposes ROK-FORTRESS—the first bilingual, culturally grounded adversarial benchmark—that employs an innovative transcreation matrix to disentangle the independent and interactive effects of language (English/Korean) and geopolitical entity (U.S./South Korea). The framework integrates adversarial and benign prompt pairs, a binary scoring rubric crafted by domain experts, and an LLM-as-a-judge evaluation protocol for systematic assessment. Findings reveal a pervasive language-driven suppression effect in Korean-language models, partially mitigated by South Korean geopolitical context. Moreover, model performance varies significantly under language–geopolitics interactions, with no evidence of reverse amplification effects.
📝 Abstract
Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics.
Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.