PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing large language model (LLM) multilingual safety auditing suffers from two key limitations: narrow language coverage—predominantly restricted to English and Chinese—and coarse-grained safety definitions. To address these, we propose PolyGuard, the first end-to-end multilingual safety guard system supporting 17 languages. Methodologically, it leverages multilingual instruction tuning and hybrid data distillation, integrating real human–AI interaction data with manually verified machine-translated annotations. Our contributions include: (1) releasing PolyGuardMix, a high-quality multilingual safety dataset comprising 1.91 million samples, and PolyGuardPrompts, a benchmark of 29K prompts, both annotated with fine-grained ternary labels (harmfulness of prompt/response and refusal behavior); and (2) achieving a 5.5% average improvement over state-of-the-art open-source and commercial models across multiple safety and toxicity evaluations, establishing PolyGuard as the current best-performing open-source multilingual safety classifier.

Technology Category

Application Category

📝 Abstract

Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

Problem

Research questions and friction points this paper is trying to address.

Addresses multilingual safety gaps in LLM moderation

Expands safety coverage to 17 underrepresented languages

Improves moderation accuracy over existing tools by 5.5%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual safety model for 17 languages

Largest safety corpus with 1.91M samples

Outperforms existing classifiers by 5.5%

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)