Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of current large language models to adversarial prompts leveraging Chinese-specific linguistic features—such as pinyin obfuscation, character decomposition, and internet slang—in safety-critical contexts. To this end, the authors introduce ChiSafe-PAS, a Chinese safety evaluation benchmark encompassing four high-risk domains: self-harm/violence, illicit drug trade, fraud, and sarcasm. The benchmark includes 1,544 manually curated adversarial samples annotated with a multidimensional labeling scheme that captures three types of safe responses, nine evasion strategies, risk severity levels, and justifications. ChiSafe-PAS represents the first culturally grounded, fine-grained safety assessment framework tailored to high-risk Chinese-language scenarios, offering a reproducible and high-quality tool to advance safety alignment of large language models in Chinese contexts.
📝 Abstract
When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
Chinese adversarial prompts
cultural grounding
high-stakes domains
evasion techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese LLM safety
adversarial prompts
human-annotated benchmark
obfuscation taxonomy
cross-lingual alignment