PL-Guard: Benchmarking Language Model Safety for Polish

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing LLM safety evaluation tools exhibit strong bias toward high-resource languages, leaving low-resource languages—such as Polish—without systematic safety assessment. Method: We introduce the first human-annotated Polish-language LLM safety classification benchmark and adversarial perturbation dataset, featuring fine-grained safety categories and diverse attack types. Our approach integrates supervised fine-tuning and adversarial testing using HerBERT, Llama-Guard-3-8B, and a Polish-adapted PLLuM model. Results: The lightweight HerBERT classifier achieves significantly higher accuracy than state-of-the-art guard models under both standard and adversarial settings, demonstrating the efficacy of domain-adapted lightweight models. This work fills a critical gap in LLM safety evaluation for low-resource languages and provides a reusable methodology and benchmark infrastructure for non-English safety assessment.

Technology Category

Application Category

📝 Abstract

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM safety bias in non-English languages

Creating Polish safety benchmark with adversarial samples

Evaluating model robustness across architectures and sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually annotated benchmark dataset for Polish safety

Adversarially perturbed variants to test robustness

Fine-tuned HerBERT-based classifier achieves best performance

🔎 Similar Papers

No similar papers found.